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ABSTRACT. 

Starting  with  an  exact  definition  of  classes  of  SIMD  (single 
instruction,  multiple  data)  systems,  a  general  approach  to  obtain¬ 
ing  lower  time  bounds  by  data  flow  analysis  is  presented.  Several 
interconnection  schemes,  such  as  the  square  net,  the  perfect  shuf¬ 
fle,  the  infinite  binary  tree,  etc.  are  analyzed  with  respect  to 
their  data  transfer  possibilities.  For  some  types  of  computational 
problems  the  data  dependencies  are  analyzed  in  a  quantitative  way. 
From  both  types  of  analysis,  lower  time  bounds  result  for  many  com¬ 
binations  of  SIMD  systems  and  computational  problems,  for  example, 
&)(log  N)  for  on-line  quadtree-net  systems  and  the  computation  of 
Voronoi  diagrams  for  N  planar  points,  £f(N)  for  off-line  diagonal- 
pet  systems  and  the  two-dimensional  discrete  Fourier  transform,  and 
for  off-  or  on-line  Illiac-net  systems  and  sorting  of  N  items. 
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0.  Introduction 

A  general  approach  to  characterizing  the  inherent  complexity 
of  computational  problems  is  given  by  the  quantitative  analysis 
of  the  extent  of  the  data  flow  that  has  to  be  performed  during 
the  solution  of  these  problems.  On  the  other  hand,  any  parallel 
processing  system  possesses  a  restricted  ability  for  fast  data 
transfer  determined  essentially  by  the  interconnection  pattern  of 
the  processing  elements.  In  the  present  paper,  these  general 
observations,  as  previously  mentioned  by  GENTLEMAN  (1978) ,  SIEGEL 
(1979),  ABELSON  (1980),  or  KLETTE  (1980),  will  be  transformed  in¬ 
to  precise  definitions  of  local,  global  and  total  data  transfer 
within  SIMD  systems,  and  the  corresponding  definitions  of  local, 
global  and  total  data  dependencies  for  computational  problems  as 
well.  The  basic  relation  between  these  corresponding  notions  - 
the  computational  time  must  at  least  be  sufficient  for  realizing 
the  necessary  extent  of  data  transfer  -  will  be  represented  in  a 
so-called  data  transfer  lemma  that  outlines  the  starting  point  of 
our  formalized  method  of  obtaining  lower  time  bounds  by  data  flow 
analysis.  This  approach  will  be  illustrated  by  application  to  a 
variety  of  different  parallel  processing  architectures  where  the 
unifying  feature  will  be  that  we  shall  use  SIMD  models  that  employ 
an  interconnection  network  and  use  no  shared  memory.  Our  parallel 
processing  systems  will  be  abstract  models  of  computation  where 
the  level  of  abstraction  may  be  compared  with  that  of  a  random 
access  machine  (RAM);  cp.  AHO  et  al.  [2]  for  this  model  of  serial 
computation.  For  computational  problems  such  as  those  mentioned 


in  the  present  paper  the  author  was  inspired  by  the  digital 
image  processing  area,  where  reference  is  made  to  ROSENFELD  et  al. 
[9]  and  KLETTE  [5].  But,  of  course,  this  does  not  represent 
a  serious  restriction;  e.g.,  matrix  multiplication  or  pattern 
matching  are  computational  problems  of  general  importance. 

The  general  SIMD  model  as  used  in  this  paper  is  character¬ 
ized  by  a  finite  or  infinite  set  of  processing  elements  (PEs) , 
an  interconnection  network,  and  a  central  processing  unit  (CPU) . 

For  a  rough  scheme  of  an  SIMD  system  which  the  reader  may  have 
in  mind  throughout  this  paper,  see  Fig.  1. 

CPU.  The  CPU  has  a  (central)  random  access  memory  which 
consists  of  a  finite  or  infinite  sequence  of  registers  rQ,r1,r2, 

...  with  a  distinguished  accumulator  rg.  Let  Dcpu  be  the  depth 
of  this  random  access  memory,  i.e.,  the  number  of  CPU  registers, 
for  l5DCpu5°°  .  Furthermore,  let  Wcpu  be  the  word  length  of  these 
registers  (number  of  bit  positions) ,  which  is  assumed  to  be  con¬ 
stant  for  all  CPU  registers,  for  l-wcpu-°°  *  Tlie  CPU  spreads  a 
single  instruction  stream  to  the  synchronized  working  PEs.  The 
programs  of  the  system  are  stored  in  a,  potentially  size-unlimited, 
special  program  memory  of  the  CPU.  Part  of  any  instruction  ad¬ 
dressed  to  the  PEs  is  an  enable/disable  mask  to  select  a  subset 
of  the  PEs  that  are  to  perform  the  given  instruction;  the  remaining 
PEs  will  be  idle.  The  CPU  may  read  the  accumulator  contents  of 
any  one  PE  of  a  specified  subset  of  all  PEs,  and  is  able  to  transfer 
its  accumulator  contents  to  some  of  the  PE  accumulators.  Any  data 
transfer  between  CPU  and  PEs  is  restricted  to  serial  mode. 


PEs .  Each  PE  has  some  (local)  random  access  memory  which 


consists  of  a  finite  or  infinite  sequence  of  registers  rQ, r^, 
r2»...  with  a  distinguished  register  rQ  called  the  accumulator. 
Let  D__  be  the  depth  of  these  random  access  memories,  i.e., 

PE 

this  depth  is  assumed  to  be  constant  for  all  PEs  of  a  given 
system,  for  l£DpE-“  .  Furthermore,  let  WpE  be  the  unique  word 
length  of  the  PE  registers,  for  l5WpEs»  .  Each  PE  is  capable 
of  performing  some  basic  operations  which  take  place  in  its 
accumulator.  Direct  data  access  is  restricted  to  its  own  regis¬ 
ters,  to  the  accumulators  of  the  directly  connected  PEs  in  the 
sense  of  the  given  interconnection  network,  and,  possibly,  to 
the  accumulator  of  the  CPU.  The  PEs  are  indexed  by  integers 
or  tuples  of  integers.  Each  PE  knows  its  index.  Let  NpE, 

,  be  the  number  of  PEs  of  a  given  system,  and  ind= 

{ j1 , j-, . . . , jN  }  be  the  set  of  all  PE  indices  of  a  given  SIMD 
l  i  NpE 

system. 

Interconnection  network.  Each  PE  is  located  in  a  node  of 
a  given  undirected  graph  representing  the  two-way  interconnec¬ 
tion  scheme.  Any  PE  may  uniquely  identify  the  different  edges 
connected  to  its  node  by  using  a  given  coding  scheme.  Let  NIN 
be  the  branching  degree  of  the  network,  i.e.,  the  maximum  degree 
of  the  nodes  of  the  given  graph,  for  Osnin<“>  . 

For  the  selection  of  a  specialized  SIMD  model  the  following 
system  features  may  be  concretely  specified: 

•  off-line  or  6n-line  communication  with  the  outside  world, 

•  special  values  for  ^pg'^j^'^cpu'^pE'^cpu'  ^PEf 


•  the  set  ind 


•  the  interconnection  network  structure  including  the 
edge  coding  scheme, 

•  the  CPU  instruction  set  including  the  available  set  of 
enable/disable  masks  as  well  as  the  method  of  the  data  ex¬ 
change  between  CPU  and  PEs,  and 

•  the  restrictions  on  the  system  in  communication  with 
the  outside  world,  i.e.,  input  and  output  management. 

Note  that  as  regards  the  technical  realization  of  an  SIMD  com¬ 
puting  facility,  in  principle,  one  implementation  may  offer 
different  ways  to  run  such  a  system,  i.e.,  the  working  princi¬ 
ples  of  several  SIMD  models  as  considered  in  the  present  paper 
may  be  unified  within  one  implementation.  Essentially,  this 
is  the  problem  of  constructing  a  flexible  interconnection  net¬ 
work  with  reconfigurability,  and/or  of  running  a  system  using 
different  modes. 

The  outline  of  this  paper  is  as  follows.  In  the  first 
section  we  shall  present  some  standardized  system  description 
features  for  specifications  of  SIMD  models.  In  Section  2  we 
shall  describe  how  the  data  flow  cf  an  SIMD  system  may  be  mea¬ 
sured  by  functions  in  a  quantitative  way.  Then,  in  Section  3 
the  corresponding  notions  of  data  dependencies  will  be  explained 
for  computational  problems.  In  Section  4  the  data  transfer  lemma 
will  be  given  as  well  as  some  applications  of  this  lemma  to  dif¬ 
ferent  models  of  computation  for  lower  time  bound  determination. 
Our  concluding  remarks  are  given  at  the  end  of  the  paper. 


The  standard  SIMD  models  as  described  in  Section  1  consti¬ 
tute  the  framework  of  a  parallel  simulation  system  (PARSIS) 
presently  under  implementation;  cp.  LEGENDI [7]  for  a  similar 
project  for  simulation  of  cellular  processors. 


.  OFF-NETs  and  ON-NETs 


In  our  experience  in  parallel  program  design  the  exclusion 
of  given  technical  restrictions,  e.g.,  on  NpE,  NpN,  etc.,  in 
the  first  steps  of  problem  solutions,  enables  us  to  find  important 
methods  of  parallelization  of  solution  processes  as  well  as  gen¬ 
eral  features  for  system  description.  Of  course,  for  concrete 
implementation  quite  a  lot  of  time  must  be  spent  in  taking  given 
restrictions  for  NpE,  NpN,  etc.  into  consideration.  The  present 
paper  is  concerned  with  the  first  phase,  the  theoretical  prepara¬ 
tion  for  the  second  phase,  which  is  the  concrete  implementation. 

In  this  sense,  we  shall  deal  with  abstract  SIMD  models  throughout 
this  paper.  More  detailed  discussion  will  be  the  subject  of  forth¬ 
coming  papers,  depending  on  the  progress  of  the  PARSIS  project. 

The  common  one-accumulator  computer,  e.g.,  the  random  access 
machine  (RAM)  in  the  sense  of  AHO  et  al.  [2] ,  may  be  considered 
as  the  simplest  example  of  an  abstract  SIMD  system  -  NpE=0  and 

DCPU=WCPU=  °°  '  We  sha^  use  t^ie  RAM  as  underlying  model  for 
serial  data  processing  where,  in  distinction  to  [2] ,  infinite 
precision,  real  number  arithmetic  is  assumed,  which  is  conveni¬ 
ent  for  our  theoretical  considerations  of  computational  problems 
such  as  the  Fourier  transform,  or  for  operations  on  infinite  sets 
of  points  in  the  real  plane,  by  avoiding  discussions  of  round-off 
errors.  In  this  sense,  our  standardized  system  description  fea¬ 
tures  start  with  the  declaration  of  abstract  registers. 

Abstract  registers.  For  an  SIMD  system  with  abstract  regis¬ 
ters  we  assume  that  any  .  ' str  may  store  one  real  number  at  a 

time,  without  any  special  encoding  tricks.  For  our  theoretical 


considerations  in  this  paper,  it  is  not  important  to  specify 
how  the  reals  are  stored  in  these  abstract  registers  by  spe¬ 
cial  bit  representations. 

Standard  register  enumeration.  We  assume  a  unique 
enumeration  of  all  registers  as  follows.  For  registers  rm  of 
the  PE  with  index  j  or  (j,k),  called  PE  ( j )  or  PE(j,k)  in  the 
sequel,  we  use  the  integer  tuples  (j,m)  or  (j,k,m),  respec¬ 
tively,  and  for  register  rm  of  the  CPU  just  the  integer  m. 

Uniform  network  structure.  Either  N«,=0,  or  N^„=p>l  and 
— -  IN  IN 

the  network  structure  is  characterized  by  p  different  functions 
fg,f^, . . . 'fp_i  on  the  set  ind  of  all  PE  indices  in  the  follow¬ 
ing  way.  For  j , k € ind ,  PE(j)  and  PE(k)  are  directly  connected 
iff  there  exists  an  i,  0£i2p-l,  such  that  f^(j)=k.  Because  of 
our  assumption  that  all  connections  are  two-way  it  follows  that 
(Aj,kOnd)  [(vi€{0,l,...,p-l»fi(j)=ks(vh€{0,l,...,p-l» 

fh(k)=j]. 

In  [10]  the  functions  ^ q'^1' *  *  * '^p-i  were  called  interconnection 
functions.  With  the  exception  of  a  fixed  set  of  PEs  at  the  net¬ 
work  border,  we  also  claim  that  all  PEs  are  directly  connected  to 
exactly  p  different  PEs.  When  f^(j)=k,  PE(k)  is  called  the  ith 
neighbor  of  PE(j).  In  this  way,  the  edge  coding  scheme  for  uni¬ 
form  networks  is  defined.  For  each  PE,  the  neighborhood  consists 
of  all  (i.e.,  at  most  p)  neighbor  PEs.  Examples  of  infinite  net¬ 
works  as  well  as  finite  networks  matching  our  uniformity  demand 
are  given  in  Table  1.  In  the  sequel  we  shall  use  these  networks 
as  defined  here. 


Some  remarks  are  necessary  regarding  Table  1.  The  left- 
right  21  (LR2I)  network  and  the  left-right-up-down  21  network 
(LRUD2I)  network  were  used  for  vector  machines  in  PRATT  et  al. 

[8]  and  KLETTE  et  al.  [6]  ,  respectively,  without  the  restriction 
by  an  integer  m  as  stated  in  Table  1.  Note  that  we  have  restricted  our¬ 
selves  to  interconnection  networks  with  finite  branching  degree.  The 
special  form  of  the  set  ind  in  the  Quadtree  network  is  determined 
by  our  standard  PE  address  masking  scheme  as  defined  later  on. 

The  finite  uniform  networks  mentioned  in  Table  1  were  studied  by 
SIEGEL  [10]  -  the  perfect  shuffle  (PS),  the  ILLIAC,  the  Cube, 
the  plus-minus  21  (PM2I) ,  and  the  wrap-around  plus-minus  2^ 
(WPM2I)  network,  with  the  modification  that  the  PS  network  is 
an  undirected  graph  to  match  our  uniform  network  convention, 


i.e.,  for  the  PS  network  the  inverse  shuffle  function  was 
added  in  comparison  to  [10].  For  j€jnd={Q,l, . . . ,2m-l>  let 
am-1  . . .  denote  the  binary  representation  of  j  and  de¬ 

note  the  complement  of  a^.  Then 
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,  ^-1 
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’  ai+laiai-l  '•*  a0)  =  am-l 
•  a±  ...  aQ)  -  bm_1  ...  bi  . 


ai+laiai-l 


where  bjL_1. .  .  .b^b^  (ai_1- .  -a^^. .  .a^a^+l  mod  2m, 

WPM-i (am-l • • • ai • • -ao)=  bm-l* **bi* **b0' 
where  bi_1  •  ••  bgb^  ...  bi+1bi  =  (ai_1  ...  a^^  ...  ai+1ai)-l 

mod  2m,  for  0si<m  and  m>l. 

Standard  PE  masking  scheme.  As  standard  masks  we  shall  use 
the  simple  bit  patterns  for  PE  indices  as  used,  for  example,  in 

In  the  case  of  integer  indices,  a  standard  PE  address  mask 


[10]  . 


is  given  by  an  arbitrary,  non-empty  word  on  the  alphabet  {0,l,x} 
enclosed  by  brackets,  where  x  represents  the  "don't  care"  situa¬ 
tion.  The  only  PEs  that  will  be  active  are  those  whose  address 
(i.e.,  index)  matches  the  mask  from  right  to  left,  where  the 
indices  are  given  in  binary  representation;  0  matches  0,  1 
matches  1,  and  either  0  or  1  matches  x.  For  example,  by  mask 
[x]  all  PE's  are  activated.  For  the  representation  of  concrete 
standard  masks  within  programs,  etc.  we  take  liberties  such  as 
[all  PE's]  instead  of  [x] ,  or  [odd  PE's]  instead  of  [lx]  if  the 
rightmost  bit  position  is  assumed  to  be  the  sign  position.  In 
the  case  of  integer  tuple  indices,  the  standard  PE  address  masks 
are  arbitrary  tuples  of  non-empty  words  on  {0,l,x}  enclosed  by 
brackets.  Note  that  for  infinite  networks  as  given  in  Table  1 
any  given  PE  address  mask  activates  an  infinite  manifold  of  PE's 
For  example,  the  mask  [Oxx]  applied  to  the  bintree  network  will 
activate  the  processing  elements  PE  (2)  and  PE  (3)  on  layer  1  of 
the  bintree,  disables  layer  2,  enables  the  first  four  PE's  of 
layer  3,  and  so  on,  where  the  common  binary  representation  of 
non-negative  integers  is  assumed  for  the  PE  indices  of  the  bin- 
tree  network. 

Abstract  CPU  instruction  set.  For  any  one  of  our  theore¬ 
tical  SIMD  systems,  we  shall  assume  that  its  CPU  instruction 
set  may  be  obtained  by  special  interpretation  and  selection  of 
the  instructions  of  an  abstract  CPU  instruction  reservoir  de¬ 
fined  as  follows.  There  are  two  different  types  of  instructions 
parallel  instructions  for  activating  some  of  the  PEs,  and  serial 
instructions  where  the  CPU  itself  is  addressed  for  certain 


mask,  an  operation  code  (READ,  WRITE,  LOAD,  STORE,  OP,  or 


0P£+ an<*  an  operation  address  a  where  we  shall  use 
the  standard  register  ennumeration  for  explaining  the  mean¬ 
ing  of  these  operation  addresses.  For  the  serial  instruc¬ 
tions  ,  we  assume  branching  instructions  JUMP  b,  JGTZ  b,  JZERO 
b,  JLTZ  b  (where  b  symbolizes  an  instruction  number  in  a  CPU 
program  and  the  contents  of  the  CPU  accumulator  are  tested) , 
the  HALT  instruction,  and  instructions  consisting  of  an  opera¬ 
tion  code  (READ,  WRITE,  LOAD,  STORE,  OP-^  or  OP2) .  See  Table 
2  for  the  complete  abstract  CPU  instruction  set  without  jump 
and  stop  instructions.  In  the  case  of  a  parallel  instruction, 

OPj^  denotes  a  unary  operation  determining  the  new  accumulator 
contents  of  all  activated  PEs  by  a  certain  transformation  of  the 
contents  of  the  register  addressed  by  a  as  well  as  the  old  accu¬ 
mulator  contents  of  the  activated  PEs;  and  0P^+1  denotes  an  (C+l)- 
ary  operation  in  the  same  sense.  For  the  activated  PE(j)  the 
operation  address  m  indicates  the  contents  of  register  (j,m),*m 
indicates  the  contents  of  register  (j,n)  if  the  nonnegative  inte¬ 
ger  n  is  the  contents  of  register  (j,m)  at  that  moment  (i.e.,  in¬ 
direct  operand  addressing,  in  any  situation  of  incorrect  program¬ 
ming;  e.g. ,  in  the  case  that  (j,m)  does  not  have  a  nonnegative 
integer  contents  at  that  moment,  an  interrupt  of  the  programmed 
system  is  assumed),  and  the  operand  :  i^,i2»***fi^  for  £>1  indi¬ 
cates  the  contents  of  the  accumulators  of  those  neighbors  of 
the  activated  PEs  that  are  encoded  by  i^i,2,...,i^  according  to 


the  edge  coding  scheme  of  the  interconnection  network.  LOAD 
and  STORE  have  the  obvious  meanings  that  the  accumulator  con¬ 
tents  of  the  activated  PEs  are  replaced  by  the  addressed 
value,  or  copied  to  the  addressed  registers,  respectively. 

READ  and  WRITE  denote  the  necessary  operations  for  communica¬ 
tion  with  the  outside  world  where  the  source  and  the  destina¬ 
tion  of  the  data  in  the  "outside  world"  remain  unspecified 
(certain  places  within  a  computing  environment  not  belonging 
to  the  given  SIMD  system  itself) .  In  the  case  of  a  serial 
instruction,  the  unary  operation  OP-^  and  the  binary  operation 
0P2  produce  new  CPU  accumulator  contents  by  a  certain  transfor¬ 
mation  of  the  addressed  values,  where  in  the  case  of  0P2  the  old 
CPU  accumulator  contents  is  used  as  the  operand  in  the  first 
position.  READ,  WRITE,  LOAD,  and  STORE  have  the  obvious  fixed 
meanings.  The  operands  =x,m,*m,  and  (j)  indicate  the  data  unit 
x  itself,  the  contents  of  CPU  register  m,  the  contents  of  CPU 
register  n  if  register  m  contains  the  nonnegative  number  n  at 
that  moment,  and  the  contents  of  register  (j,0),  respectively. 
Note  that  with  this  abstract  CPU  instruction  set  data  transfer 
between  the  CPU  and  the  PEs  is  possible  via  the  accumulators 
in  serial  mode  only.  Furthermore,  for  a  specialized  SIMD  model, 
it  is  convenient  to  identify  the  basic  computational  power  of 
the  PEs  and  the  CPU  with  that  of  the  RAM  as  represented  by  the 
RAM  instruction  set  [2,  Fig.  15],  roughly  speaking.  In  this 
way,  an  interesting  point  is  provided  by  the  description  of  how 
the  PEs  are  able  to  perform  local  logical  decisions  in  SIMD 
mode  as  we  shall  explain  in  Example  1  by  equation  (1)  for  a  spe¬ 


cial  SIMD  model. 


Off-line  I/O  convention.  For  the  off-line  communication  of 
an  SIMD  system  with  the  outside  world  we  assume  that  a  special 
set  of  input  registers  of  the  system  is  fixed  such  that  all  other 
registers  of  the  system  contain  value  zero  at  the  beginning  of 
any  computation  (moment  t=0)  as  it  is  assumed  for  those  input 
registers  not  actually  needed  for  the  placement  of  input  data. 

Each  of  the  input  registers  may  contain  at  most  one  data  unit 
of  the  input  data.  Thus,  for  concrete  problem  solutions,  it 
is  necessary  to  specify 

•  what  data  structure  is  assumed  for  the  given  input  data, 
and 

•  how  the  data  are  placed  in  the  given  input  register  set. 
Also,  a  set  of  output  registers  of  the  system  must  be  fixed.  In 
this  sense,  for  concrete  problem  solutions  it  has  to  be  clear 

•  what  is  the  desired  data  structure  for  the  output  data,  and 

•  how  this  data  structure  has  to  be  stored,  of  computed  in 
the  predetermined  output  register  set. 

As  off-line  I/O  convention  we  declare  that  for  a  certain  L, 
l<L£DCpu,  the  CPU  registers  0,1,..., L-l  are  fixed  to  be  input 
and  output  registers,  and  for  any  PE(j),  if  there  exists  a  certain 
m^O  such  that  register  (j,m)  is  fixed  to  be  an  input  register 
(output  register)  then  register  (j,0)  is  an  input  register  (output 
register)  as  well.  What  is  true  for  the  register  holds  for  the 
accumulator,  too. 

On-line  I/O  convention.  For  the  on-line  communication  of 
an  SIMD  system  with  the  outside  world  some  registers  are  predeter¬ 
mined  to  act  as  input  and/or  output  registers.  As  on-line  I/O 


convention  we  adopt  the  same  rules  as  in  the  off-line  case. 

But,  at  the  beginning  of  any  on-line  computation  (moment  t=0) , 
all  registers  of  the  system  are  assumed  to  hold  value  zero. 

Input  data  or  output  data  may  enter  or  leave  the  system  at  a 
moment  as  specified  by  the  CPU  program  according  to  READ  or 
WRITE  instructions.  In  any  correct  program  these  input  (out¬ 
put)  instructions  have  to  be  addressed  to  a  proper  subset  of  all 
registers  specified  as  input  (output)  registers.  For  the  in¬ 
put  (output)  data  it  is  assumed  that  there  exists  a  memory  facili¬ 
ty  in  the  outside  world  from  where  (to  where)  the  input  (output) 
data  are  obtained  (given)  by  the  system.  Thus,  for  concrete 
problem  solutions  it  is  necessary  to  specify 

•  what  data  structures  are  assumed  for  the  input  and  out¬ 
put  data,  and 

•  how  these  data  are  partitioned  into  waves  of  information 
such  that  one  wave  may  enter  (leave)  the  system  per  input 
(output)  operation  as  performed  according  to  the  CPU  program. 

The  size  of  these  waves  of  information,  i.e.,  the  number  of  data 
units  forming  those  waves,  may  alter  during  a  computation  process, 
and  just  one  data  unit,  for  example  by  LOAD  =  x,  will  be  considered 
to  be  the  simplest  case  of  a  wave  of  information. 

Uniform  cost  criterion.  For  measuring  the  time  complexity 
of  computations,  we  assume  that  any  (basic)  instruction  of  the 
SIMD  system  needs  one  unit  of  time  for  performance  on  this  system. 

Definition  1.  A  model  of  computation  SYS  is  called  a  standard 
off-line  network  system  (SYS  €  OFF-NET)  iff  SYS  is  defined  by 


r 


•  a  CPU  and  a  fixed  set  of  indexed  PEs,  with  concrete 
values  for  Dcpu  and  DpL, 

•  abstract  registers  if  not  otherwise  specif  od,  and  the 
standard  register  enumeration, 

•  a  uniform  interconnection  network  with  (FN  ' 

IN 

•  the  standard  PE  masking  scheme, 

•  a  special  interpretation  and  selection  of  instructions 
of  the  abstract  CPU  instruction  set  where 

(0FF.1)  no  READ  and  WRITE  instructions  are  contained  in 
the  instruction  set  of  SYS, 

(OFF. 2)  for  the  CPU  all  RAM  instructions  [2,  Fig.  1.5] 
except  READ  and  WRITE  are  avilable, 

(OFF. 3)  for  NIN=p-l  at  least  one  instruction  of  the  type 
[all  PE's]  0Pp+i  :  Q,.,...,p-1  is  available,  and 
(OFF. 4)  for  any  output  register  (j,0),  i.e.,  accumulator 
of  PE ( j ) ,  at  least  one  instruction  of  the  type 
OP2(j)is  available,  i.e.,  the  CPU  may  have  con¬ 
trol  of  any  outputting  PE, 

•  the  off-line  I/O  convention,  and 

•  the  uniform  cost  criterion. 

For  the  defined  class  OFF-NET  we  may  define  subclasses  - 
e.g.,  OFF-NETp  to  be  the  set  of  all  SYStOFF-NET  having  the 
branching  degree  p=NlN,  OFF-SQUARE  to  be  the  set  of  all  SYSt 
OFF-NET  having  a  square  network  as  defined  in  Table  1,  OFF- 

Ob 

BINTREE  with  the  same  reference  to  Table  1,  OFF-PS=  U  OFF-PSm, 

m=l 

or  just  OFF-RAM. 


Example  1.  Let  us  consider  the.  followinj  special  SIMD 
system  EX  AM  Pi  OFF-SQUARE.  Let  DCPU=DPE='"  •  Additionally 
to  the  CPU  registers  0,1,..., L-l  for  a  certain  L-l,  all  the 
accumulators  (i,k,0),  0.  j  M  and  O' k  N  for  some  M,N  1,  are 
fixed  as  input  and  output  registers  of  EXAMP l .  The  system 
possesses  the  following  instruction  set: 

[mask]  ADD  u,a  for  m,  *m,  :  i  j  ,  .  .  .  ,  i .  for  i^,...,i^ 

-10,1,2,3}, 

[mask]  OP  a, oi  for  m,  *in,  :i  for  id.  [0,1, 2, 3  ,  ;  =  1,2, 
[mask]  LOAD  u,a  for  m,  *m,  :i  for  it[0,l,2,3}, 

[mask]  STORE  a,  a  for  m,  *m,  :  i  ^  ,  . .  . ,  i  for  i^,...,i^, 

<  10,1,2,3}, 

LOAD  a, a  for  =x,  m,  *m,  (j,k), 

STORE  a, at  for  m,  *m,  (j,k), 

OP,  a, a  for  -x,  m,  *m,  (j,k), 

JUMP  b,  JCTZ  b,  JZERO  b,  JLTZ  b,  and  HALT. 

Here,  [mask]  represents  an  arbitrary  PE  address  mask,  OP^  is 
ABS  (absolute  value)  or  SIGN  (signum  function) ,  OP2  is  ADD, 
SUB,  MULT,  or  DIV,  for  the  tuples  lj,k)  with  0  j  M  and  Ok  N. 


To  give  a  short  illustration  of  the  computing  power  of 
EXAMP 1  let  us  consider  the  computation  o:  the  parallel  Robert 
gradient  (cp.  19]  for  its  importance  to  digital  image  proces¬ 


sing),  where  the  input  image  A  =  (a^)  of  size  M  N  is  assumed 


to  be  stored  in  the  PE  input  registers  (a^k  i-n  register  (j,k, 

0)  )  at  the  beginning  of  the  computation.  At  the  end  of  the  o  m 


putation,  value  maxi , ajR-aj+1 fk+1 ! ,  a j+i ,k~a j , k  + ■  >  has  to  h( 


present  in  register  (j,k,0). 


By  performing  the  following  sequence  of  parallel  instructions, 


1. 

[all 

PEs] 

STORE  1 

7. 

[all 

PEs] 

STORE  3 

2. 

[all 

PEs] 

LOAD  : 2 

8. 

[all 

PEs] 

LOAD  1 

3. 

[all 

PEs] 

STORE  2 

9. 

[all 

PES] 

LOAD  : 1 

4. 

[all 

PEs] 

LOAD  : 1 

10. 

[all 

PEs] 

SUB  2 

5. 

[all 

PEsJ 

SUB  1 

11. 

[all 

PEs] 

ABS  0 

6. 

[all 

PEs] 

ABS  0 

12. 

[all 

PEs] 

STORE  4 

registers 

s  ( j  >k 

,3)  contain  value 

,ajk- 

•aj+l 

,k+l* ' 

and  all 

registers  (j,k,4)  contain  value  }c“aj  k+i  I  /  ^or  0-j<M 

and  05k<N.  These  values  may  be  considered  as  two  M*N  mat¬ 


rices  B  and  C.  For  max(B,C)=(max[bjk,Cjk})  we  have 

max(B,C)=B  x  sign(B-C)  +  C*  sign(C-B)  +  B  -  Bxsign|B-c|,  (1) 


where  x  means  the  parallel  MULT  operation  (cross  product  of  two 
matrices) ,  and  sign  the  parallel  SIGN  operation.  Using  this 
formula,  the  parallel  Roberts  gradient  may  be  computed  on  the 


defined  special  OFF-SQUARE  system  within  time  29  or  less,  inde¬ 
pendent  of  the  values  of  M  and  N,  as  the  reader  may  check  easily. 
Note  that  formula  (1)  describes  a  way  in  which  the  PEs  are  able 
to  perform  local  logical  decisions  in  SIMD  mode. 

Example  2 .  By  some  easily  described  modifications,  the  sys¬ 
tem  EXAMP 1  may  be  altered  dramatically.  Replace  the  square  net¬ 
work  by  LRUD2Iin,  for  m<max{log2M,  log2N},  let  Wp£=l,  and  replace 
the  parallel  operations  ADD,  0P1  and  0P2  by  logical  operations 
AND,  NOT,  and  OR,  respectively.  What  results  is  a  special  OFF- 
LRUD2lm  system  EXAMP 2  which  essentially  coincides  with  the  PBS 
(paralleles  Binarbildverarbeitungssystem) .  The  computational 
power  of  the  PBS  was  extensively  studied  in  [4] . 


Definition  2.  A  model  of  computation  SYS  is  called  a 
standard  on-line  network  system  (SYS60N-NET)  iff  SYS  is  defined 
by 

•  a  CPU  and  a  fixed  set  of  indexed  PEs,  with  concrete 
values  for  Dcpu  and  DpE, 

•  abstract  registers  if  not  otherwise  specified,  and  the 
standard  register  enumeration, 

•  a  uniform  interconnection  network  with  0^NIN<“  , 

•  the  standard  PE  masking  scheme, 

•  a  special  interpretation  and  selection  of  instructions 
of  the  abstract  CPU  instruction  set  where,  for  NpN  2, 
an  integer  tuple  (p,q)  may  be  denoted  to  be  the  charac¬ 
teristic  of  SYS  in  the  following  sense: 

( ON .  1 )  P=NIN  and 

(ON. 2)  a  proper  subset  {i^ , i2 , . . . , ig }  of  all  directions 

{0,1, . . . ,p-l>  is  specified, 

(ON. 3)  at  least  one  instruction  of  the  type 

[all  PE’s]  opq+i  : 
is  available, 

(ON. 4)  for  any  of  the  instructions  [mask]  LOAD  :  j  or 

[mask]  OPk^+1j  :  j1 , j2 , . . . , jk ,  k>l,  it  follows 

that  *  •  *  ,  3  ^  ^  ^”1^^"2^  *  *  *  *^q^  * 

(ON.  5)  for  any  of  the  instructions  [mask]  STORE  :  j ,  j  2  ^ 

...,jk,  ktl,  it  follows  that  jj^,  j  ,. .  . ,  jk€{0,l, 

. ..,p-l}-{i^,i2, . ..,i  },  i.e.,  the  results  of  con¬ 
secutive  parallel  operations  may  be  shifted  through 
the  system  in  directions  { 0, 1 , . . . ,p-l >-{ i^ , i2 , . . . , ia 
only,  and,  furthermore 


(ON. 6)  for  the  CPU  all  RAM  instructions  are  avilable 

including  READ  and  WRITE, 

(ON. 7)  for  any  output  register  (j,0),  at  least  one 

instruction  of  the  type  0?2(j)  is  available, 

•  the  on-line  I/O  convention,  and 

•  the  uniform  cost  criterion. 

For  the  defined  class  ON-NET  we  may  define  subclasses  - 

e.g.,  ON-NET  to  be  the  set  of  all  ON-NET  systems  with 
P  /  4 

characteristic  (p,q) ,  ON-LR2lm  to  be  the  set  of  all  SYS  SON¬ 
NET  having  a  left-right  21  network  as  defined  in  Table  1,  ON- 

<x> 

ILLIACm  with  the  same  reference  to  Table  1,  0N-PM2I=  U  0N- 

m=l 

PM2im,  or  just  ON-RAM. 

Any  infinite  network  class  OFF-LINEAR  or  ON-DIAGONAL  may 
be  considered  as  an  abstraction  of  a  finite  network  system, 
or  as  the  union  of  classes  of  finite  network  systems  in  the  fol¬ 
lowing  way. 

Definition  3.  Let  OFF-IN  be  the  set  of  all  OFF-NET  systems 
which  are  defined  by  a  special  infinite  network  IN,  e.g.,  IN= 
LINEAR  or  IN=LRUD2im.  A  model  of  computation  SYS  is  called  a 
finite  OFF-IN  system  (SYS €FIN-OFF-IN)  iff  there  exists  a  system 
SYSqCOFF-IN  such  that  SYS  may  be  obtained  as  a  restriction  of 
SYSq  in  the  following  sense: 

Let  indQ  and  DpE  be  the  PE  index  set  and  the  PE  memory  depth 
for  SYSq,  respectively.  A  finite  cut-off  of  the  PE  register  set 
of  SYSq  is  defined  by  a  certain  finite  subset  ind  of  indp  and  a 
(possibly  infinite)  memory  depth  DpE£DpE*  The  work  of  SYS  maY 
be  described  as  follows.  All  registers  in  a  certain  finite  cut¬ 
off  of  SYSq  are  available  in  SYS  but  all  registers  not  in  this 


finite  cut-off  will  be  considered  to  be  dummy  registers,  i.e., 
they  are  assumed  to  store  value  zero  if  addressed  as  an  oper¬ 
and,  and  to  "forget"  any  value  handed  over  to  them;  this  is  the 


only  difference  between  SYSq  and  SYS. 

Analogously  the  set  FIN-ON-IN  may  be  defined. 

Example  3 .  An  example  of  a  FIN-ON-BINTREE  system  may  be 
specified  as  follows.  Let  Dcp^00  and  DpE=m>2.  The 
cut-off  of  the  bintree  network  is  given  by  ind={l,2, . . . ,2m-l }. 
Additionally  to  the  CPU  accumulator  which  acts  as  an  input 
and  output  register  (L=l) ,  the  registers  (2m_1,0) ,  (2m  1+1,0), 
. . . , (2m-l,0) ,  i.e.,  the  accumulators  of  the  2m_1  leaf  node 
PEs,  are  fixed  as  input  registers,  and  register  (1,0),  i.e., 
the  accumulator  of  the  top  node  PE,  is  fixed  as  an  output 
register.  The  system  possesses  the  following  instruction  set: 
[mask]  ADD  a,  a  for  m,  *m,  :  1,  :  2,  :  1,2, 

[mask]  OP^a  ,  a  for  m,  *  m,  :  1,  :  2  and  4=1,2, 

[mask]  LOAD  a  ,  a  for  m,  *m,  :  1,  :  2, 

[mask]  STORE  a  ,  a  for  m,  *  m,  :  0, 

[subset  leaf  nodes]  READ  0, 

[top  node]  WRITE  0, 

LOAD  a  ,  a  for  =x,  m,  *m,  (1) , 

STORE  a  ,  a  for  m,  *m,  (1) , 

OP^  a  ,  a  for  =x,  m,  *m,  (1),  and  4=1,2, 

READ  0, 

WRITE  a  ,  a  for  =x,0, 

JUMP  b,  JGTZ  b,  JZERO  b,  JLTZ  b,  HALT. 


Here,  [mask]  represents  an  arbitrary  PE  address,  OP^  either 

ABS  or  SIGN,  OP2  one  of  the  operation  codes  ADD,  SUB,  MULT,  or 

DIV.  Altogether,  a  FIN-ON-BINTREE  system  EXAMP 3  is  defined 

which  may  be  obtained  by  a  restriction  of  an  infinite  ON-BINTREE 

model  where  infinite  sets  of  input  and  output  PE  registers  are 

available  in  the  infinite  origin. 

To  give  a  short  illustration  of  the  computational  power  of 

the  system  EXAMP3  let  us  consider  the  computation  of  the  arith- 

1  N_1  n-1 

metical  average  ^  Z  a. ,N=2  and  n  odd,  for  M  consecutive 

i=0 

waves  of  information  (ag  ,  a^, . . .  ,aN_j, )  where  a^  is  fed  to  the 
accumulator  of  the  PE(2n  1+i) ,  for  i=0,l, . . . ,N-1.  In  order  of 
the  M  consecutive  waves  of  information  the  arithmetical  average 
have  to  leave  the  system  via  register  (1,0) . 

For  initialization  of  the  system,  at  first  the  instruction 
LOAD=N,  STORE  (1) ,  [top  node]  STORE  1  will  be  performed  in  this 
order.  For  M>(n-l)/2  the  following  sequence  of  instructions  is 


executed  (n-l)/2  times: 
[leaf  nodes]  ] 


[all  PEs] 
[leaf  nodes] 
[all  PEs] 


READ  0, 

ADD  :  1,2, 
LOAD  1, 

ADD  :  1,2, 


followed  by  the  following  sequence  of  instructions  which  is  exe¬ 
cuted  M-[(n-l)/2]  times: 


[top  node] 
[top  node] 


DIV  1, 
WRITE  0, 


[leaf  nodes]  READ  0, 


[all  PEs] 


ADD  :  1,2, 


[leaf  nodes]  LOAD  1, 

[all  PEs]  ADD  :  1,2. 

Finally,  the  following  sequence  of  instructions  is  executed 
(n-3)/2  times: 

[top  node]  DIV  1, 

[top  node]  WRITE  0, 

[all  PEs]  ADD  :  1,2, 

[all  PEs]  ADD  :  1,2, 

followed  by  the  last  two  instructions  [top  node]  DIV  1  and 
[top  node]  WRITE  0.  Thus,  altogether,  the  arithmetic  averages 
of  M>(n-l)/2  consecutive  waves  of  information  (a0 ,a^ , . . . ,aN_1) 
may  be  computed  within  6M+n  basic  operations  of  EXAMP  3,  in¬ 
stead  of  0(N*M)  basic  operations  in  the  serial  case  using  a  RAM 
as  model  for  computation. 

In  conclusion,  we  point  out  that  SIMD  now  denotes  not  a 
general  concept  (single-instruction,  multiple  data)  but  an  exactly 
defined  class  of  models  for  computation,  namely  the  union  of  all 
system  classes  given  by  Definitions  1,  2,  and  3. 


2 .  Local,  global,  and  total  data  flow  measures 


Let  SYS 6SIMD;  throughout  this  paper  such  a  special  parallel 
processing  system  will  be  used  as  a  standard  system  for  con¬ 
siderations  of  data  transfer  restrictions  in  computing  systems. 
Any  computational  process  performed  on  such  a  model  SYS  may 
be  uniquely  specified  by  a  CPU  program  tt  and  a  concrete  input 
situation  I  characterized  by  the  placement  of  input  values  in¬ 
to  the  set  of  input  registers  if  off-line  mode  is  used,  or  by 
the  partition  of  the  input  data  into  consecutive  waves  of  infor¬ 
mation  fed  to  some  of  the  input  registers  of  the  system  from  the 
outside  world  if  on-line  mode  is  used. 

As  suggested  by  applications  to  visual  perception,  the  set 
of  input  registers  of  the  model  SYS  may  be  considered  as  the 
retina  of  the  system,  and  any  new  wave  of  information  to  this 
set  of  input  registers  represents  a  snapshot  of  the  outside 
world.  In  this  sense,  after  t  steps  of  a  computational  process 
characterized  by  a  program  rr  and  an  input  situation  I,  for  any 
register  r  of  the  system  we  may  mark  out  a  certain  receptive 
field  rec*(r,t)  containing  all  the  names  of  those  input  registers 
which  have  had  any  influence  on  the  contents  of  register  r  up 
to  the  moment  t,  where  new  waves  of  information  to  the  retina  of 
the  system  create  new  names  of  the  input  registers,  formally 
represented  by  r^  ,r  ^  ,r  ^  , . . .  ,r  ^  , . . .  for  register  r. 

Standard  register  names.  At  time  t=0  of  any  computational 
process,  each  register  r  in  our  standard  enumeration  possesses 
the  name  r^.  At  t=0  let  the  wave  number  WN=0  also.  At  time 
t+1  assume  that  a  serial  or  parallel  READ  instruction,  or  an 


instruction  LOAD=x,  OP^=x,  or  OP2=x  has  to  be  performed-  Then, 

(WN) 

by  this  operation  we  obtain  WN<-WN+1  and  the  new  names  r  for 

all  registers  which  were  addressed  by  these  instructions.  For 

(WN ) 

example,  the  number  (j,c(j,m))  in  the  case  of  an  instruction 

[mask]  READ  *  m  for  all  activated  processing  elements  PE(j), 

where  c(j,m)  denotes  the  actual  contents  of  register  (j,m),  or 
(WN) 

the  name  0  in  the  case  of  an  instruction  OP2=x. 

Definition  4.  Let  SYS^SIMD.  standard  register  names  are 
assumed.  For  a  program  n  of  SYS,  an  input  situation  I  of  SYS, 
a  register  r  of  SYS,  and  an  arbitrary  moment  t>0,  the  receptive 
field  rec*  (r,t)  is  recursively  defined  as  follows: 
moment  t=0: 


rec*(r,0)  = 


!r  >  if  input  register  r  stores  an 

)  input  value  according  to  I,  for 
off-line  mode, 


empty  set,  otherwise 


moment  t+1 , tsO : 

At  moment  t+1  a  certain  instruction  has  to  be  applied  according 
to  rr  and  I,  or  the  HALT  instruction  is  assumed  for  this  moment. 

(i)  Depending  on  this  instruction,  if  it  is  one  of  those  listed 
in  Table  3,  the  changes  of  receptive  fields  are  defined  as  given 
in  this  Table  where  we  omit  the  indices  n  and  I  for  simplification 
of  the  expressions.  In  the  case  of  parallel  instructions,  the 
mentioned  changes  are  valid  for  all  activated  PEs  PE(j)  where 

j  matches  [mask] . 

(ii)  For  the  parallel  or  serial  LOAD  instructions  the  changes 
of  receptive  fields  are  the  same  as  for  the  corresponding  OP^ 


instructions. 


I 


(iii)  In  the  case  of  a  WRITE,  JUMP,  or  HALT  instruction  no 
changes  of  receptive  fields  appear. 

(iv)  In  the  case  of  a  JGTZ,  JZERO,  or  JLTZ  instruction  no 
changes  of  receptive  fields  appear  in  step  t+1,  but  the 

set  rec(0,t)  will  be  added  at  moment  t'>t+2  to  any  receptive 
field  that  alters  at  moment  t'  according  to  (i)  or  (ii) ,  if 
at  moment  t'  an  instruction  has  to  be  performed  covered  by 
cases  (i)  and  (ii) .  For  example,  the  instruction  [mask]  0P2 
m,  at  moment  t’>t+2,  will  produce  the  changes  rec ( (j ,0) ,t' ) = 
rec  ( ( j ,0) , t' -1) Urec ( ( j ,m) ,  t' -1  ) Urec (0,t)  for  all  activated 
PEs . 

For  illustration  of  this  definition,  consider  the  special 
OFF-SQUARE  system  as  defined  in  Example  1.  Let  I  be  any  con¬ 
crete  input  situation  for  computing  the  parallel  Roberts  gra¬ 
dient  and  let  tt  be  the  sequence  of  the  12  parallel  instructions 
as  given  there.  At  moment  t=0  we  have  rec ( (j ,k,0) ,0)  =  { (j ,k,0)  ^ 
for  0<j<M  and  0<k<N,  and  for  any  other  register  r  of  the  system 
EXAMP  1,  rec(r,0)  is  the  empty  set.  After  performing  the  12  in¬ 
structions  of  n  the  reception  fields  of  maximal  cardinality  2 
belong  to  the  registers  (j,k,0),  (j,k,3)  and  (j,k,4),  for  0<js 
M-2  and  0<k<N-2,  where,  e.g.,  rec  ( ( j ,k, 0) , 12)  =  {  ( j+1 ,k, 0)  ^ , 
(j,k+l,0)  }.  For  the  system  defined  in  Example  3,  and  the 

program  and  the  input  situation  as  described  there,  after  per¬ 
forming  the  6M+n  instructions  the  receptive  field  of  maximal 
cardinality  NM+1  belongs  to  the  register  (1,0),  i.e., 
cumulator  of  the  top  node  PE. 


to  the  ac- 


Definition  5.  Let  SYS'SIMD.  For  a  set  R  of  registers  of 
SYS  and  a  moment  t^O  define  the  local  data  transfer  function 

XSYS  by 

X  (R,t)  =  max  max  max  card (rec1  (r,t)), 

SYb  n  I  r€R 

the  global  data  transfer  function  ycvc  by 


rovc(R, t)  =  max  max  card  (  U  rec  (r,t)) 
SYb  hi  r€R  " 

the  total  data  transfer  function  by 


t cvc'R't)  =  max  max  Z  card (rec  (r,t)). 

SYb  rr  I  r  €R  n 

By  this  definition,  it  follows  immediately  that  the  func¬ 
tions  \gYS,  rsys  and  tsys  are  monotonically  increasing  for  any 
set  R  of  registers  of  SYS  and  increasing  values  of  t.  Further 
more , 

XSYS^R,t^  ~  YSYS^R,t^  5  TSYS^R,t^  ^ 

for  all  models  SYSCSIMD,  sets  R  of  registers  and  moments  t>0. 
Also  note  that  for  any  model  SYS,  if  within  t  steps  of  an  ar¬ 
bitrary  program  rr  for  SYS  starting  with  an  arbitrary  input 
situation  I  for  SYS  at  most  “gys^  inPut  data  may  be  fed  to 
the  system,  then 


YSYS^R'fc)  5  “SYS^'  and 

Tgyg(R,t)  <  Xg YS ( R , t ) • car d ( R ) , 


(3.1 

(3.2 


for  any  set  R  of  registers  of  SYS  and  t>0. 

Example  4 .  In  Section  4  we  shall  characterize  the  way 
to  use  these  data  transfer  functions  for  obtaining  lower  time 
bounds  for  concrete  computational  problems.  For  serial  data 
processing  we  shall  apply  the  system  RAM^,  cp.  [2,  Fig.  1.5], 


as  a  model  for  computation,  where  ^={0,1,2, ... ,L-1 },  L5l, 
is  assumed  to  be  the  set  of  all  input/output  registers  of 
such  a  machine  (Dcpu=»  ,  NpE=0,  Wcpu=~) .  For  t>0,  we  have 

oo 

a30FF-RAML(t)=L+t  and  “0N-RAML(t,=t*  For  0FF~RAM=  U  OFF-RAM^ 


note  that  a> 


OFF-RAM 


(t) =max 


OFF- 


Furthermore ,  we  have 


',OFF-RAMLtRL,t)  » 


RAMl 


L=1 

(t)  is  not  defined. 


2t+l  for  Osts  i  (L-l)/2j 


l(L+l)/2j  +t,  otherwise. 


OFF- 


““l 


(Rl, t) 


rOFF-RAML(RL't)  '  L+t'  and 

T OFF-RAM  (RL'fc)  =  L<t-lL/2j+l)  for  t>lL/2j, 

L 

in  the  case  of  using  the  RAM^  in  off-line  mode,  and 

XON-RAML(RL,t)  =  YON-RAML(RL't)  *  t# 

(  t(t+l)/2  for  t<L 


(4.2 

(4.3 


-RAMl 


(4.5 


L(t-(L/2)+%)  for  t:tL, 


in  the  case  of  using  the  RAM^  in  on-line  mode.  The  maximal 
data  flow  for  obtaining  equation  (4.1)  is  possible  by  indirect 
addressing  0P2  *ra,  followed  by  OP2=x  operations.  For  (4.3), 
the  same  sequence  of  operations  is  extended  by  L-l  instruc¬ 
tions  STORE  m.  For  (4.4),  t  operations  of  the  type  OP2=x  may 
be  considered.  For  small  t  the  exact  derivation  of  the  func¬ 
tion  toff-RAML  rePresents  a  sophisticated  problem  already, 
for  this  quite  simple  model  of  serial  computation. 

Example  5.  For  further  illustration  of  the  concrete  deri¬ 
vation  of  these  data  transfer  functions,  let  us  consider  both 


systems  EXAMPl  and  EXAMP3  as  defined  above. 


For  the  system  EXAMPl,  first  we  see  that  ^exAMPI^^ 

MN+L+t,  for  t?0.  Let  R„  „  be  the  set  {(j,k,0):  0£j<M  and 
0£k<N}  of  all  PE  input/output  registers  of  the  system.  By 
using  t  operations  of  the  type 
[all  PE's]  ADD  : 0,1, 2, 3 

we  obtain  the  maximal  local  and  total  data  transfer  within  the 
field  of  PE  accumulators,  where 

XEXAMPl(RM,N,t)=2t  +2t+1'  (5‘1) 

(2t2  +  2t+l)MN  -  -  (t+1)2  +  2  )  (M+N)  £  (5.2) 

TEXAMPl(RMfN't}-(2t:+2t+1)MN' 

for  2t+l£min{M,N},  by  elementary  combinatorial  considerations 
and  (3.2).  For  t2tQ=iM/2j  •  iN/2j  we  have 

MN+<t-t0)SXEXAMPl<HM,N't)SMN+L+t-  (5-3) 

For  t5tg=M+N-2  we  can  easily  see  that 

MV  +  (t-tQ)  £  ^eXAMPI^M,**'^  -  MN  (MN+L+t)  .  (5.4) 

Finally,  for  the  case  of  global  data  transfer  we  obtain 

1MN  for  t=0 

MN  +  2t  +  1  for  2t+l£L  and  t>0  (5.5) 

MN  +  i(L-l)/2j+t  for  2t+l>L 

where,  for  2t+l£L,  the  maximal  global  data  transfer  is  possible 
by  t  operations  of  the  type  ADD  *mfc  and  one  operation  STORE (j,k), 
e.g. 

For  the  system  EXAMP3 ,  at  first  we  have  wexAMP3 ^ =t *N'  for 
N=2  and  t>0  by  using  t  operations  of  the  type 
[leaf  nodes]  READ  0. 

Let  R0-{0,(1,0)>  be  the  set  of  the  two  distinguished  output  re¬ 
gisters  of  this  system  EXAMP3 .  By  using  the  instruction  pair 


{leaf  nodes]  READ  0, 

[all  PEs]  ADD  :1,2 

repeated  (m-1)  times,  m>l;  the  single  instruction 
[leaf  nodes]  READ  0 

again;  and  finally  (n-1)  instructions 
[all  PEs]  ADD  : 1 , 2 , 


we  obtain  the  maximal  local  data  transfer  for  register  (1,0) 
in  any  case  t>m.  We  have 


XEXAMP3*R0,t:-  "  2 


for  t=0 
for  lstsn-1 
for  t=n+2m-£ ,m>l 
and  (■- 1  or  1=2, 
for  all  t^O.  Analogously,  for  the  same  set  Rq  and  t>0 


t-1 
m*  N 


YEXAMP3(R0,1:) 


TEXAMP3(R0't) 


0 

for 

* 

o 

H 

2t“l 

for 

lstsn-1 , 

m*N 

for 

t=n+2m-2 ,  m>l 

m*N+l 

for 

t=n+2m-l ,  m>l 

0 

for 

t=o. 

2t-l 

for 

lstsn+l. 

2m*N 

for 

t=n+2m-l ,  m>l 

2m*  N+l 

for 

t=n+2m,  m>l. 

(6.1) 


Of  course,  the  values  of  \xfMp3 ,  rEXAMP3,  and  TEXAMp3  depend  on 
the  choice  of  the  set  Rq,  and  may  be  quite  different  for  some 
other  sets  of  registers. 

Definition  6.  Let  CLASS£SIMD.  The  general  data  transfer 
functions  are  defined  as  follows,  for  such  a  set  CLASS  of  models 


of  computation,  for  t,n20 


ACLASS^  denotes  the  maximal  value  of  all  igYS(R,t), 
rCLASS(n,t)  denotes  the  maximal  value  of  all  YgYS(R,t) 
with  card  (R)=n,  and 

TCLASs(n»t)  denotes  the  maximal  value  of  all  xSYS(R,t) 
with  card  (R)=n,  where  SYS  is  an  arbitrary  element  of 
CLASS,  and  R  denotes  a  set  of  registers  of  SYS. 

Interesting  examples  of  CLASS  are  sets  like  OFF-NETp,  ON-NET 
OFF-SQUARE,  OFF-BINTREE,  or  ON-HEXAGONAL,  where  these  general 
data  transfer  functions  are  fully  defined. 


Theorem  1.  For  standard  off-line  network  systems  and 


2<p<«.  We  have 


AOFF-NETp(t) 


2t+l 


for  p=2 


for  p>3 , 


rOFF-NET  (n,t)  TOFF-NET  (n,t)  n*  AOFF-NET  ' 

P  P  P 

for  n,t20. 

Proof .  First,  let  us  consider  the  local  situation.  For 
p=2,  the  maximal  transfer  of  data  units  is  possible  by  indirect 
addressing  to  the  CPU  accumulator,  e.g.  For  p^3,  there  exist 
special  OFF-NETp  models  SYSt  such  that,  according  to  (OFF. 3), 

s— i 

at  any  moment  lisst  the  maximal  possible  number  of  p(p-l) 
new  names  of  input  registers  may  enter  the  receptive  field  of 


a  certain  register  r,  for  t>0.  Thus, 


t-1  ,  1 

^ eve  =1+2  p(p-ds  =  prp:H  —  )+i. 

bYbt  s=0  p  z 

For  the  total  and  global  situation  note  that  by  choosing 

sufficiently  complex  SYS  ,  for  n,t>0,  the  maximal  local 

n  / 1 

situations  of  data  transfer  characterized  by  receptive  fields 

of  cardinality  Aopp_NET  (t)  at  moment  t  may  appear  in  n  dif- 

P 

ferent  registers  at  time  t  such  that  these  registers  are  far 
enough  from  one  another  so  that  their  receptive  fields  are 
pairwise  disjoint,  o 

Example  6.  By  (4.1)  and  Theorem  1,  it  follows  that 

flOFF-RAM<t)'flOFF-HET2<t)*2t+1'  f°r  t’°'  0f  COUrSe'  this 
coincidence  is  not  true  in  the  total  and  global  cases.  Accord¬ 
ing  to  Theorem  1  we  have  r0FP_NET  (n,t)=Topp_NET  (n,t)-n(2t+l) , 

2  2 

for  n,t>0,  but  by  elementary  considerations  roFF-RAM*n't)=2t+n' 

for  n>l  and  t2:0,  and  Topp_RAM(nrt)*2n(t-n+2)-2f  for  t>n>2. 

In  Table  4  the  general  local  data  transfer  functions  are 

collected  for  some  classes  of  off-line  systems  as  defined  in 

Section  1.  For  these  classes,  the  functions  A___  „„„  as  given 

Urr  — Ntip 

in  Theorem  1  act  as  upper  bounds,  where  the  proper  value  of  p 
has  to  be  specified.  The  classes  OFF-LINEAR,  OFF-PS,  OFF- 
B INTREE  and  OFF-QUADTREE  represent  examples  for  the  maximal 
transfer  situations  as  characterized  by  Theorem  1,  for  p=2,3,5, 
respectively. 

Some  remarks  about  Table  4  and  about  the  other  networks 


which  were  defined  in  Table  1. 


1.  For  the  bintree,  triangle  and  quadtree  network  note 
that  the  maximal  receptive  fields  may  be  obtained  for  central 
nodes  of  these  tree  structures  only,  and  not  at  the  top  node. 

The  maximal  possible  cardinalities  of  receptive  fields  of  top 
node  accumulators  are  given  for  illustration  of  this  fact. 

2.  For  all  examples  of  CLASS  given  in  Table  4,  we  have 

r OFF-CLASS (n,t)=TOFF-CLASS(n,t)=n'AOFF-CLASS(t) '  f°r  n't-°* 

3.  The  hexagonal,  square,  triagonal,  and  diagonal  networks 

are  special  examples  of  infinite  graphs  of  constant  degree  p 
such  that  the  general  local  data  transfer  function  is  equal  to 
t2  +  ^  t  +  1.  Such  networks  correspond  to  usual  digital  met¬ 
rics  for  the  orthogonal  grid  in  a  natural  way,  e.g.,  the  metrics 
d4  or  dg  as  used  in  digital  image  processing,  cp.  [9] ,  to  the 
square  or  diagonal  network,  respectively. 

4.  For  the  networks  CUBEm,  PM2lm,  WPM2Im,  LR2Im,  or  LRUD2Im, 
the  derivation  of  the  three  general  data  transfer  functions  repre¬ 
sents  a  very  sophisticated  problem.  Of  course,  the  values  of 
these  functions  depend  on  the  value  of  m,  and  the  consideration 

of  classes  like 

CUBE  =  U  CUBE1” 
m2  2 

would  lead  to  undefined  general  data  transfer  functions.  In  [4] 
the  general  local  data  transfer  functions  were  analyzed  for  some 
concrete  SIMD  systems  similar  to  FIN-0FF-LR2Im  or  FIN-0FF-LRUD2I1" 
systems  like  EXAMP2  which  was  defined  above.  But,  for  the  present 
paper,  we  recommend  data  transfer  analysis  for  specialized  (finite) 
SIMD  systems  to  the  interested  reader,  and  are  satisfied  with  some 


hints: 


CUBE  :  For  this  system,  the  exact  derivation  of  the  local 


transfer  function  should  be  a  solvable  task.  We  have 

t 

2  .  . 

i=0  1 


=  2  (I?) 


for  t<m 


AOFF-CUBEm^ 


.m 


_ra+l ,  . 

>  2  (t-m) 


For  example,  we  have  A 


OFF-CUBE 


for  t=m 
for  t>m. 

256  (4)  =  177,589,057  and  A 


OFF-CUBE 


14 


is  about  4-10 

PM2Im:  For  this,  as  for  the  other  "power-of-two  systems," 
the  analysis  of  data  flow  represents  quite  a  hard  problem,  cp. 
[4J.  But,  to  give  the  reader  some  feeling  about  the  complexity 
of  the  data  transfer  functions  for  these  systems,  some  values 
will  be  collected: 


AOFF-PM2lm(t) 


«  1 
=  2 

=  2 (m-1) (m-2) +4 


for  t=0 
for  t=l 
for  t=2 


>  2 


m 


>  2m+1(t-[m/21) 


for  t=fm/2l 
for  t>  fm/2 1  . 

Note  that  exponential  increase  changes  to  linear  increase  at 
t=  fm/21  . 


WPM2I 


m 


It  may  be  that  this  is  the  most  complicated  situa- 
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tion  of  any  network;  we  have 


« 


A0FF-WPM2lm(t) 


>  2m+1 (t- fm/21 ) 


for  t=0 
for  t=l 

for  t=rm/21 
for  t?  fm/21 


This  great  difficulty  in  analyzing  data  paths  should  be  a  hint 
to  the  limited  practical  importance  of  this  network. 

LR2Im:  For  brevity  we  shall  use  the  function  a(i)= 


2  j2=  |(i+l)-  |(i+l)2+  j(i+l)3, 

j=l 

interesting  values: 


We  found  the  following 


1 

for 

t=0 

2m+l 

for 

t=l 

2 (m-2) 2+4m+l 

for 

t=2 

l+6m+4 (m-2) 2+2 • a (m-4) 

for 

t=3 

l+8m+6 (m-2) 2+4 • a (m-4) + 

m-6 

for 

t=4 

V0FF-LR2Im 


(t)  = 


+4  •  2 
i=l 


c  (i) 


l+10m+8 (m-2 )  +6*a(m-4)+ 
m-6 

+  8-  2  a  (i)  + 
i=l 

m-8  i 

+  8  2  2  a  ( j  ) 

i=l  j=l 


2m*  t-c. 


for  t=5 


for  t>f(m-l)/2] 


The  contents  cm  depend  on  the  value  of  m  only,  for  example 

c2=-l,  Cg=l ,  c4=7 ,  c5=25,  c6=71,  0^=185,  Cg=455,  Cg=1081,  and 

c,n=2503.  Because  the  LR2lm  is  an  infinite  network  r.„_  T_OTm 
10  0FF-LR2I 

(n,t)=TOFF_LR2Im(n,t)=n*AOFF_LR2lin(t)  ,  for  n,t>0. 

LRUD2Im:  Of  course,  we  have 

AOFF-LRUD2Iin^  ~2  '  A0FF-LR2lm^ '  for  t“0,  and,  because 
LRUD2I111  is  an  infinite  network  we  have  r0FF_LRUD2Im(n,t)  =T^F_LRUD2lm 

(n,t)=n*  A0FF-LRUD2im(t) '  for  n't-°* 

Theorem  2 .  For  standard  on-line  network  systems  and  2sp<», 


i-q~p~i , 


ON -NET 


(t)  =  /  2t-l 


(q^-D/Cq-l) 


and  T 


ON-NET 


(n,t)==T 


ON -NET 


for  t=0, 

for  t2l  and  q=l, 
for  t^l  and  q2:2. 


(n , t)  n'  AON=NET 


(t) ,  for  n,t20. 


Proof .  Consider  the  local  data  transfer  situation  first.  At 
t=l  assume  that  a  sufficiently  large  set  of  input  registers  ob¬ 
tain  input  data  in  parallel  by  a  READ  instruction.  Then  (q-1)/ 
(q-1) =2t-l=l  for  q>2,  or  t=l .  For  q=l ,  the  maximal  local  trans¬ 
fer  situation,  i.e.,  the  maximal  transfer  of  data  units  to  a  given 


register, is  possible  by  indirect  addressing.  Thus,  A0N-NET 

E 

2t-l  for  t>l.  For  q^2,  according  to  (ON. 3)  it  follows  that 


(t)  = 


ON-NET 


W  i  t 

(t)  =  Z  q  =(q  -l)/(q-l)  , 


where  these  maximal  cardinalities  of  receptive  fields  may  be  ob¬ 
tained  in  certain  PE  accumulators.  For  given  n,  t20,  by  choosing 
a  sufficiently  large  field  of  PEs  obtaining  input  data  in  their 
accumulators  at  the  first  instruction  (i=l) ,  n  receptive  fields 


of  maximal  cardinality  AON_ 


ON-NETr 


(t)  may  be  pairwise  disjoint.  □ 


Example  7.  By  (4.4)  we  know  that  A0N_RAM (t) =fon-RAM {n ' fc) =t ' 


for  t>0  and  nil,  and  thus  ''o(J-RAMlt,<'1ON-NET 


(t)  as  well  as 


rON-RAM(n,t)<rON-NET  (n't}  for  and  n-1*  Furthermore, 

P,1 


n  ,  1 


ON-RAM 


(n,t)=n(t~2  +j)  ,  for  t>n^l ,  and  thus  T0N_RAM(n,  t)<TQN_I 


(n,t)  for  tz:n2:2. 

In  Table  5  for  classes  of  on-line  systems  mentioned  in 
Section  1  some  results  on  the  analysis  of  general  local  data  trans 
fer  functions  are  collected.  For  these  classes  the  functions 
given  in  Theorem  2  act  as  upper  bounds  where  the  proper  values  of 


p  and  q  have  to  be  correlated.  By  ON-IN 


^*1 ' *2 '  *  *  *  '  ^ 


we  denote  a 


special  ON-IN  system  with  fixed  set  {i^ , i2 , . . .  ,  i  }  according  to 
(ON. 2)  .  The  classes  ON-LINEAR ^  Q  ,  ON-BINTREE^  2y  and  ON- 
QUADTREE^  234}  rePresent  examples  for  maximal  transfer  situa¬ 
tions  as  characterized  by  Theorem  2. 

Some  remarks  about  Table  5  and  about  the  other  networks 
which  were  defined  in  Table  1  '• 

1.  For  all  examples  of  CLASS  in  Table  5  we  have  roN-CLASS 
TON-CLASS(n,t)=n'  AON-CLASS 

2.  The  class  ON-PS^Q  ^  denotes  special  SIMD  systems  using 

the  PS  network  in  its  original  [10]  meaning.  Let  fQ=l,  f2=2 

*  *  *  '  ^n+2=fn+fn+l . where 

fn=  [  (l+\/5)  n+1-  (l-\/5) n+1  ]  /V3  •  2n+1 

denotes  the  nth  Fibonacci  number,  n?0.  We  have  A  M  _,c  (t)  = 

ON  Pb{0,l} 
t 

I  fn=fn+2~2,  for  t20;  cp .  [3]  for  a  similar  result. 
n=l 

3.  For  the  bintree ,  triangle,  and  quadtree  network  note  that 
the  maximal  receptive  fields  may  be  obtained  for  the  top  node 
accumulator,  for  (i^ , i2 , . . . , i^ }  equal  to  {1,2},  {1,2, 3, 4}, 

{1, 2,3,4},  respectively. 


Local,  global,  and  total  data  dependence  measures 


For  parallel  processing  systems,  the  optimal  time  for  the 
solution  of  a  computational  problem  depends  upon  the  data 
transfer  abilities  of  the  given  system  as  well  as  on  the  prin¬ 
cipal  possibilities  of  parallelization  of  a  solution  process 
for  a  given  problem.  The  first  may  be  characterized  by  the  data 
transfer  functions  AgYS ,  rSYs'  Tsys  a  9eneral  sYstem  analysis 
as  considered  in  Section  2.  The  second  property,  however,  re¬ 
quires  individual  consideration  of  the  given  computational  prob¬ 
lem. 

For  example,  consider  the  multiplication  of  two  N*N  real 

2 

matrices  A-B=C.  For  a  given  system  SYS  assume  that  all  N  ele- 

2 

ments  of  matrix  C  have  to  be  computed  in  N  different  output 
registers  represented  by  the  set  RQUT.  Let  r6R0UT,  R0£ROUT' 
and  R-^  be  the  set  of  N  distinctive  registers  for  outputing  the 
N  diagonal  elements  of  C.  Then  it  follows  that  XgYS (r , t*) ?2N, 

ySYS  -21*2  and  TSYS  (Ro,t*)-2N,cardi(Ro)  if  the  product  A •  B 

is  to  be  computed  on  SYS  within  time  t* .  Thus,  if  the  functions 
AgYS*  rsYS  or  TSYS  are  known'  iower  time  bounds  are  derivable  from 
these  inequalities  for  the  solution  time  t*  immediately,  where 
the  maximal  lower  time  bound  from  the  three  possible  values  is 
taken  as  the  result.  For  example,  according  to  our  considerations 
in  Section  2  for  the  system  EXAMPl  we  have  t*?\/N-l  under  the  as¬ 
sumption  that  M=2N.  But  note  that  a  better  lower  time  bound  for 
this  system  and  the  matrix  multiplication  problem  may  be  obtained 
by  more  specialized  considerations  as  demonstrated  by  GENTLEMAN 
[3,  Theorem  1].  Because  each  data  unit  transfer  from  a  certain 


register  r^  to  a  certain  register  r2  of  the  system  EXAMP1  may 
be  performed  in  the  reverse  direction,  from  r2  to  r^ ,  in  the 
same  time,  the  proof  of  Theorem  1  in  [3]  matches  the  situation 
given  by  the  system  EXAMP1,  i.e.,  for  r€R0UT  we  have  XEXAMP1 
(r,2t*)5N2,  and  thus  t*>j(2N2-l) 1/2-  ^ . 

For  a  general  approach  to  the  derivation  of  lower  time  bounds 
for  parallel  processing  systems  we  shall  use  the  quantitative 
description  of  data  dependencies  of  the  desired  output  data  in 
relation  to  the  input  data  specification,  for  computational 
problems  which  may  be  identified  with  special  functions  as  de¬ 
scribed  later  on. 

Definition  7.  Let  n,m>l.  Let  f  be  an  n-ary  function  defined 
on  a  certain  set  domain (f)  of  n-tuples  of  real  numbers,  and  into 
the  set  of  m- tuples  of  real  numbers.  For  an  n-tuple 
xn) £ domain (f ) ,  define 

su^  (xlfx2, . . .  ,xn)  =  {  j  :  lsjsn  &  ( vx'^x^ )  (x^ ,x2 , . . .  ,x^ ,x '  , 

Xj  fl, . . . ,xn) € domain (f)  &  proji (f ^ ,x2 , . . . ,xn) ) ? 
pro  j  ^(f(x^,x2,...,Xj_^fX*,Xj+^,...  (Xjj)  )  } 
to  be  the  set  of  all  positions  j  such  that  changes  in  the  jth 
component  of  (x1 ,x2 , . . . ,xn)  have  an  effect  on  the  projection 
projj^f,  for  l5ism.  Then,  define 

X-  =  max  max  card (sub. (x, ,x? ,... ,x) ) , 

(xlfx2, . .. ,xn)  1-i-m 

m 

Yf  -  max  card(  U  sub^ (x^ ,x2 , . . . ,xn) ) ,  and 

(x1,x2,...,xn)  i=l 

m 

t  =  max  2  card (sub. (x. ,x0 , ... ,x  ) ) . 

1  (x. »x« , . . . ,x  )  i=l 


The  function  f  is  called  locally  d-dependent  iff  d^Xf,  globally 
d-dependent  iff  dsyf/  and  totally  d-dependent  iff  d5Tf,  for  an 
integer  d>0. 

By  this  definition,  for  arbitrary  functions  f  defined  on 
n-tuples  of  real  numbers  and  into  the  set  of  m-tuples  of  real 
numbers,  it  follows  immediately  that  ^f=Y£~Tf  if  m=l,  and  for 


XfSYfSTf' 
YxrSn,  ai 


(7.1) 

(7.2) 


T^Sm*  X^ . 


(7.3) 


For  example,  in  the  case  of  the  following  function  f, 


f  (x^,x2,x3,x4,x,-) 


X1  +  X2 


if  x5=0 


x3  +  x4  if  x^O, 

we  have  sub1  (Xj^  ,x2  ,x3  ,x4 , 0  )  =  {1,2,5}  if  xi+x2^x3+x4  ,  and  sub^x^ 
x2,x3,x4,0)  =  {1,2}  if  x1+x2=x3+x4-  Because  of  Xf=Yf=Tf=3,  this 
function  is  local,  global,  or  total  1-,  2-,  and  3-dependent,  but 
not  4-  or  5-dependent. 

Now,  in  a  sequence  of  examples,  the  data  dependence  measures 
as  given  by  Definition  7  will  be  analyzed  for  certain  computational 
problems.  The  results  are  collected  in  Table  6,  i.e.,  the  follow¬ 
ing  examples  may  be  considered  as  explanatory  remarks  to  this 


table. 

Example  8 .  The  multiplication  of  two  N*N  real  matrices  may 

2  .  2 
be  considered  as  a  2N  -ary  function  into  the  set  of  N  -tuples  of 

real  numbers.  For  this  computational  problem,  it  is  evident  that 


xmatrix-multiplication 


=  2N, 


ymatrix-multi plication  2n  '  and 

t  —  2N^ 

tMATRIX-MULTIPLICATION  1  ' 


where  these  maximal  values  of  data  dependence  are  true  for  each 

2 

input  vector  of  length  2N  containing  non-zero  values  in  all 

positions.  By  this  example  it  follows  that  the  upper  bounds 

(7.2)  and  (7.3)  cannot  be  reduced  in  general.  The  inversion 

2 

of  an  N*N  real  matrix  m  place  may  be  considered  as  an  N  -ary 

2 

function  into  the  set  of  N  -tuples  of  real  numbers.  We  have 

2 

MATRIX-INVERSION-IP  r MATRIX-INVERSION-IP  ' 

x  =  N^ 

MATRIX-INVERSION-IP  ' 


where  this  maximal  case  of  data  dependence  appears  for  any 

2 

matrix  containing  non-zero  values  in  all  N  positions.  These 
data  depence  quantities  may  be  considered  as  a  direct  conse¬ 
quence  of  the  data  dependence  quantities  for  the  determinant 
of  an  N*N  real  matrix. 


DETERMINANT  YDETERMINANT  T DETERMINANT 


=  N2, 


The  solution  of  a  system  of  N  linear  equations  in  N  unknowns 

2 

may  be  considered  as  an  (N  +N)-ary  function  into  the  set  of 
N- tuples  of  real  numbers.  We  obtain 


^LINEAR-EQUATIONS  YLINEAR-EQUATIONS 

3  2 

TLINEAR-EQUATIONS  N  +N  * 


N2+N, 


and 


Transposing  an  N*N  real  matrix  in  place  may  be  considered  as 

2  2 
an  N-ary  function  into  the  set  of  N  -tuples  of  real  numbers. 


^TRANSPOSITION-IP  lf  3  d 

-  N2 

YTRANSPOSITION-IP  “  XTRANSPOSITION-IP  “  W  ' 


but  for  binary  operations  on  permutated  N*N  real  matrices  in 


considered  as  N  -ary  functions  into  the  set  of  N  -tuples  of  real 
numbers , 

for  tT^id, 
and 


^MATRIX-  n  —  IP  2 


Y  =  N2 

'MATRIX-  ir -IP  ' 


^MATRIX-  tt-IP  =  2N  ~card  Hi  ,  j  )  :  0£i  ,  j  SN-1  &  n  (i  ,  j  )  =  (i  ,  j  )  }  , 


the  transposition  may  be  considered  as  a  special  permutation  n*, 

2 


T MATRIX-  tt  -IP  =  2N  -N'  and  op2  as  exchan9e  operation  in 

this  case,  op~(a..,a  ...  ..)  =  (a  ...  ...a..),  where  the  second 

2  13  rr*(i,3)  n*(i,3)  1] 

component  of  these  resulting  tuples  will  be  considered  as  a  dummy 
result. 


Example  9.  In  this  example,  three  two-dimensional  transforms 

of  N*N  pictures  will  be  dealt  with.  First,  the  Fourier  transform 

of  an  N*N  complex  matrix  (2D-DFT,  two-dimensional  discrete 

2 

Fourier  transform,  cp.  [9])  may  be  considered  as  a  2N  -ary  func- 

2 

tion  into  the  set  of  2N  -tuples  of  real  numbers.  In  this  case, 
we  have 

2N2-4<X  ^2N^-1 

2D-DFT  ' 

y2D-DFT  =  2N2,  and 
2,,4--t2D-DFT54,,4-2N2- 

where  these  maximal  values  of  data  dependence  are  true  for  each 

2  •  . 

input  vector  of  length  2N  containing  non-zero  values  in  all  posi 

tions.  For  the  exact  determination  of  X2D-DFT  and  t2D-DFT'  the 

influence  of  different  values  of  N  has  to  be  studied.  The  Walsh 

transform  of  an  N*N  real  matrix  (2D-WT,  two  dimensional  Walsh 

2 

transform,  cp.  [9])  may  be  considered  as  an  N  -ary  function  into 
2 

the  set  of  N  -tuples  of  real  numbers, 


where  these  maximal  values  of  data  dependence  are  true  for  any 

2 

input  vector  of  length  N  .  The  computation  of  the  parallel 
Roberts  gradient  (see  Example  1)  on  images  of  size  M*N  may 
be  considered  as  an  MN-ary  function  into  the  set  of  MN-tuples 
of  real  numbers.  For  this  function, 


XROBERTS_GRADIENT 
YROBERTS_GRADIENT 
XROBERTS  GRADIENT 


4, 

MN,  and 

4MN-2M-2N-2, 


by  considering  the  case  of  non-zero  values  in  all  MN  positions, 
and  by  paying  attention  to  border  effects. 

Example  10.  The  computation  of  the  convex  hull  of  a  simple 
polygon,  cp.  [5] ,  where  the  N  extreme  points  of  the  polygon  are 
given  by  coordinate  tuples  of  real  numbers  starting  with  the 
uppermost-leftmost  point,  may  be  considered  as  a  2N-ary  function 
into  the  set  of  2N-tuples  of  real  numbers.  In  the  resulting 
vector  of  length  2N,  there  appear  all  coordinate  tuples  of  the 
extreme  points  of  the  convex  hull  of  the  given  polygon  in  order, 
starting  with  the  uppermost-leftmost  point,  and  with  the  same  run 
orientation  as  the  given  polygon.  Positions  actually  not  needed 
in  this  resulting  2N-tuple  contain  value  zero  by  assumption.  In 
this  case,  it  follows  that 


XCH_SIPOL  YCH_SIPOL  “ 
2N  -8N+12<tch  sipol-4N 


2n ,  and 


by  analyzing  the  input  situation  of  special  convex  polygons 
with  N  extreme  points  as  illustrated  in  Fig.  2,  for  N>4.  The 
computation  of  the  convex  hull  of  N  planar  points,  cp.  [5]  , 
given  by  coordinate  tuples  of  real  numbers,  may  be  considered 
as  a  2N-ary  function  into  the  set  of  2N-tuples  of  real  numbers 
as  described  above,  analogously  to  the  simple  polygon  situation. 
For  this  problem. 


XCH  POINT  YCH  POINT  2N'  and 


t  =  4N^ 

CH  POINT  ' 


where  these  maximal  values  are  true  for  any  input  situation, 
The  computation  of  the  Voronoi  diagram  of  N  planar  points. 


cp.  [5],  given  by  coordinate  tuples  of  real  numbers,  may  be 
considered  as  a  2N-ary  function  into  the  set  of  (18N-33) -tuples 
of  real  numbers  in  the  following  sense.  The  Voronoi  diagram 
may  have  2N-5  vertices  at  most,  and,  as  a  special  planar  graph, 
3N-6  edges  at  most,  for  N>3.  See  Fig.  3  for  an  illustration  of 
the  construction  of  such  a  "maximal  Voronoi  diagram,"  where 
the  number  v(N)  of  vertices,  and  the  number  e(N)  of  edges  sat¬ 
isfy  the  recursive  equations 
v (3)  =  1,  e (3)  =  3, 

v (N+l)  =  v (N)  +2 ,  and  e(N+l)  =  e(N)+3 
for  Ni3.  The  18N-33=3 (2N-5) +4 (3N-6)  positions  of  the  resulting 
vector  of  a  Voronoi  diagram  computation  we  consider  as  a  unique 
characterization  of  a  Voronoi  diagram  by  linearization  of  adja¬ 
cency  lists  for  this  special  graph  structure  with  the  positions 
for  each  vertex  where  two  are  reserved  for  the  coordinate  values 
and  one  for  a  common  pointer,  and  two  times  two  positions  for 


each  edge  -  for  the  index  of  the  vertex  at  the  other  end  of 
the  edge,  or  for  the  slope  of  the  edge,  and  for  a  common 
pointer.  For  concrete  inputs  of  N  points,  positions  actually 
not  needed  in  the  resulting  (18N-33) -tuple  contain  value  zero 


by  assumption.  Then,  we  have 

XVORONOI -DIAGRAM  ~  YVORONOI-DIAGRAM 
12N-35xVORONOI_DIAGRAM£2N(18N-33) , 


=  2N,  and 


for  N^3,  where  the  local  and  global  case  may  be  analyzed  by 
using  a  regular  N-gon,  and  for  the  total  case  a  Voronoi  dia¬ 
gram  in  the  sense  of  Fig.  3,  with  2N-5  points,  was  used  where 
each  point  of  the  diagram  essentially  depends  on  three  input 
points,  i.e.,  on  six  coordinate  values. 

Example  11.  Matching  of  a  pattern  of  length  M  against  a 
string  of  length  N  (MSN  and  the  elements  of  pattern  and  string 
are  assumed  to  be  reals)  may  be  considered  as  a  (N+M) -ary  func¬ 
tion  into  the  set  of  (N-M+l) -tuples  on  {0,1}  where,  for 

fPATTEFN_MATCHING(pl'P2'  *  *  "Pm?  S1'S2'  '  *  *  'Sm)  =  (el'e2  •'‘tj-M+I* 

we  have  e,j=l  iff  si+j=pj+i»  for  a11  3=0 ,M-1,  and  e^O 

otherwise,  for  i=l , 2 , . . . , N-M+l.  We  have 

XPATTERN_MATCHING  =  2M' 

ypattern_matching  =  M+N'  and 

XPATTERN_MATCHING  =  2M(N“M+1)* 

In  all  three  cases,  the  maximal  dependence  may  be  analyzed  for 
the  trivial  input  situation  p^=Sj=const,  for  i=l,2,...,M  and  j- 
1,2,...,N.  Detection  of  a  pattern  of  length  M  within  a  string 


of  length  N,MSN,  may  be  considered  as  a  (N+M) -ary  function  into 


the  set  {0,1}  where  the  output  is  equal  to  max{e^  :  i=l,2,..., 

N-M+l  &  fpATTERN_MATCHING(pl,P2' *  *  * ,PM?  S1 ' S2 ' *  * ' ' SN}  =  (el ,&2 ' 
...,eN_M+i)>  for  input  (P1,P2, . .. ,PM;  s± , s2 , . . . ,sN> .  Then, 

max { 2M , M+ IN/M J >-xPATTERN  siGNALIZATION"M+N* 


Note  that  this  represents  the  first  example  of  a  computational 
problem  where  the  equality  Yf=n  remains  an  open  problem,  for 
an  n-ary  function  f  with  n=N+M  in  the  case  of  pattern  detection 
As  a  last  example,  sorting  of  N  real  numbers  may  be  considered 


as  an  N-ary  function  into  the  set  of  N-tuples  of  real  numbers 
For  this  very  important  problem,  we  have 

^SORTING  =  ^SORTING  =  N'  and 

t  =  N^ 

SORTING  ' 


where  these  maximal  values  are  true  for  N  pairwise  different 
input  values. 


Data  transfer  lemma  and  applications 
Between  the  quantitative  descriptions  of  data  transfer  for 
SIMD  systems  (Section  2)  and  of  data  dependence  for  computa¬ 
tional  problems (Section  3),  the  following  direct  relation 
holds . 

Lemma  1.  (Data  Transfer  Lemma) .  Let  SYS € SIMD,  and  let  n 
be  an  arbitrary  program  for  SYS  for  the  computation  of  a  func¬ 
tion  f  which  is  n-ary  and  has  m-tuple  values.  Let  R  denote 
the  set  of  output  registers  of  SYS  where  the  m-tuples  appear 
at  the  end  of  the  computation  (card  (R)=m,  off-line  mode) , 
or  those  output  registers  of  SYS  via  which  the  computed  values 
of  the  m-tuples  leave  SYS  in  certain  waves  of  information  (card 
(R)5m,  on-line  mode).  Then,  the  computation  of  f  (x.^  ,x2 , . . .  ,xn) 
on  SYS  by  n  requires  at  least  tQ  steps  of  computation  for  a  given 
input  (x1#x2, . . .  ,xQ)  €domain(f)  ,  where  Agys^O^f'  rsYS  ^card 
(R),t0)>yf,  and  TgYS(card(R) ,t0)>xf . 


Proof.  Let  us  consider  the  local  off-line  or  on-line  situ¬ 
ation.  Assume  that  \f=card  (sub^  ^  ,x2  , . .  .xn) )  ,  for  a  given  in¬ 
put  vector  (x1,x2, . . . ,xn) ,  and  for  a  given  position  i,  l<ism. 

Let  subi(x1,x2,...,xn)={j1,j2,...,jx  ).  For  any  position  ifc, 
k=l,2, . . . ,  Xf ,  either  the  name  of  an  input  register  receiving 
value  x.  at  a  given  moment  will  be  transfered  to  the  receptive 
field  recti'  x2'  ***  ,xn)  (r^,t*)  by  some  operational  instruc¬ 
tions  only,  if  value  proj^ (f (x^/Xj , . . . ,xn) )  appears  in  register 
r*^€R  at  time  t*stQ  of  computation,  or  during  the  t*  steps  of 
computation  of  proj^ (f (x^ ,x2 , . . . ,xR) )  at  least  one  test  instruc¬ 
tion  JGTZ,  JZERO,  or  JLTZ  must  be  performed  where  the  contents  of 


the  CPU  accumulator  depends  on  the  input  value  x.  at  the 
moment  of  testing.  In  the  second  case,  if  the  test  instruc¬ 
tion  is  followed  by  certain  operational  instructions  directed 
to  register  r^  the  name  of  the  input  register  receiving 
value  Xj  at  a  given  moment  will  be  transferred  to  the  recep¬ 
tive  field  rec^xl,x2'  *  *  *  ,xn^  (r^,t*),  tco;  cp.  (iv)  in  Defi¬ 
nition  4.  Without  loss  of  generality,  assume  that  , j2 / • • • , jv» 
v£X^,  denote  all  the  positions  which  have  produced  register 
names  in  the  receptive  field  rec^xl'x2'  * '  * ,Xn^  (r  ^  ,t*)  .  If 
v=rrf,  then  trfscard(recn(xl'x2' *  *  *  ,xnJ  (r(l)  ,t*)  )sXsys(tQ)  follows 
immediately.  For  v<X^,  let  t1,t2,...,tw  be  all  the  moments 
where  test  instructions  have  to  be  performed  according  to  it 
and  input  (x.^ ,x2 ,  . . .  ,xn)  such  that  the  contents  of  the  CPU  ac¬ 
cumulator  depend  on  one  of  the  input  values  x.  ,...,x. 

■’v+l  3Xf 

at  least,  at  the  moments  of  testing.  Consider  the  following 
program  tt  *  computing  something  unspecified,  produced  by  n  and 


(x^ ,x2 , . . . ,xn)  in  the  following  way: 


-  all  test  instructions  at  moments  t^,t2,...,tw  will  be 
deleted  in  n ,  and 


-  all  other  instructions  of  tt  will  be  performed  according 
to  tt  and  input  (x^,x2 , . . .  ,xn)  ,  in  the  same  order,  where 
all  instructions  LOAD  a  or  OP^  a,  for  a  equal  to  =x,m,*m, 
or  (j) ,  will  be  replaced  by  OP 2  a,  for  the  same  value  of 
a,  if  such  instructions  appear  in  tt  . 


Thus,  the  receptive  field  of  register  0,  i.e.,  the  CPU  accumu¬ 
lator,  will  increase  monotonically  according  to  tt  *  and  (xlfx2. 


...,x  ).  After  t*-w  operations  according  to  tt ’  ,  rec(0,t*-w) 


r  •  •  •  t 


contains  all  input  register  names  for  the  input  data  x. 

f-'v+l  . 

x.  .  This  receptive  field  will  be  combined  with  rec  , 1 'x2 ' *  *  * ' ^ 

n 

Xf 

(r  (l)  ,t*-w)>recn(xl,x2'  *  *  *  'xn>  (r  (l)  ,t*)  at  moment  t*-w+lst*  by 
adding  an  instruction  0P2  a  (see  conditions  (OFF. 2)  and  (ON. 6)) 
or  OP2(j)  (see  conditions  (OFF. 4)  and  (ON. 7))  to  it1.  Thus,  Xfs 
card(reciT(xl'x2'-“'xn)  (0 , t*-w+l) ) <AgYg (t*-w+l) <AgYg (tQ) .  Note 
that  the  off-line  or  on-line  I/O  convention  is  necessary  to  en¬ 
sure  that  a  non-accumulator  PE  register  r^  may  be  replaced  by 


the  accumulator  of  the  same  PE  which  is  an  output  register,  too. 

For  this  replacement,  parallel  STORE  instructions  may  be  replaced 
by  parallel  OP^  instructions  using  the  same  masks  for  PE  addresses . 
What  we  have  explained  is  one  of  the  possible  ways  to  ensure 


the  necessary  data  transfer  within  time  limit  tQ,  for  the  local 
off-line  or  on-line  situation.  The  essential  point  in  the  program 


transformation  from  tt  to  n '  may  be  characterized  by  the  word 
"linearization,"  because  all  test  instructions  could  be  deleted. 


in  fact.  This  linearization  approach  may  be  used  for  the  local, 
global  and  total  situation  in  the  following  way. 

For  the  given  program  tt  and  an  input  situation  I,  all  the 
performed  instructions  will  be  written  as  a  linear  sequence  SQ . 

We  obtain  sequence  by  deletion  of  all  instructions  JLTZ,JZERO, 
JGTZ ,  JUMP,  WRITE,  and  HALT  in  sequence  SQ .  Now,  for  the  special 
case  of  an  on-line  program,  if  in  sequence  SQ  there  were  some 
STORE  instructions  in  front  of  a  WRITE  instruction  directed  to 


certain  output  registers  r*R,  then  these  STORE  instructions  will 
be  shifted  to  the  end  of  sequence  Sj^.  In  the  resulting  sequence 


S2,  all  serial  or  parallel  OP^  a  or  LOAD  a  instructions  will 
be  replaced  by  an  0P2  a  instruction  formally,  in  the  same 
position  for  the  same  value  of  a.  For  the  resulting  sequence 
we  have  monotonically  increasing  receptive  fields  for  all 
accumulators ,  for  the  CPU  and  PEs.  Also,  by  the  described 
step  from  S-^  to  S2,  for  sequence  the  receptive  fields  of 
output  registers  will  be  monotonically  increasing  for  conse¬ 
cutive  output  waves  of  information.  Now,  if  in  the  original 
sequence  Sq  there  was  no  test  instruction,  our  program  lin¬ 
earization  is  finished.  In  the  other  case,  in  S3  we  shall 
place  an  instruction  JZERO,  e.g.,  in  that  position  where  the 
last  test  instruction  was  located  in  sequence  SQ.  Now  consider 
an  arbitrary  output  register  r€R.  If  there  is  an  operational 
instruction  behind  the  JZERO  instruction  directed  to  r  then 
register  r  will  obtain  the  receptive  field  of  the  CPU  accumu¬ 
lator  containing  all  the  register  names  corresponding  to  tested 
input  values,  cp.  (iv)  in  Definition  4.  If  there  is  no  opera¬ 
tional  instruction  behind  the  JZERO  instruction  directed  to  r 
then  we  shift  the  last  instruction  directed  to  r  in  front  of 
the  JZERO  instruction  to  a  position  behind  this  instruction. 

By  consideration  of  all  registers  r€R,  our  program  linearization 
is  finished.  Note  that  the  length  of  the  resulting  linear  in¬ 
struction  sequence  is  restricted  by  the  length  of  the  original 


sequence  SQ. 

Now  assume  that  Xf=card  (subi  (X;L  ,x2 , . .  .  ,xn) )  for  a  certain  i, 
m  m 

l^isn,  yf=card(  U  su^  (y3  ,y2  , . . .  ,yn) )  and  xf=  Z  card  (subi  ^ , 


z2'  *  *  *  'zn*  '  for  certain  inPut  vectors  (xj.  ,x2 , . . .  ,xn)  ,  (y1,y2r**- 
yft) ,  (z^ , z5 , . . . , zn) .  These  input  vectors  characterize  input 


situations  I  ,1  ,1  for  SYS.  By  linearization  of  tr  according 
x  y  z 

to  these  input  situations  we  obtain  linear  programs  tr  ,n  , rr  , 
“  x  y  z 

respectively,  all  of  length  £ t q.  Thus,  we  have 

Si*1'*2 . ^  (R.t0)iXf 

^l'Ya . yn>  (R,t0)“rf ' 

. 2n>  (R,t0)^£- 

which  proves  our  statements.  □ 


Corollary  1.  Let  CLASS£SIMD.  For  any  system  SYS  6CLASS , 
the  computation  of  a  function  f  which  is  into  the  set  of  m- 
tuples  of  real  numbers  requires  at  least  tQ  steps  of  computa¬ 
tion  in  the  worst  case,  where  ACLAsS -Xf •  r CLASS t0) ~Yf * 

and  TCLASS(m,t0)-Tf' 

Proof.  Immediately  by  Leirana  1  where  the  generalization 
about  all  programs  computing  the  function  f  is  used  as  well 
as  about  all  systems  of  CLASS.  For  the  on-line  case  note  that 
there  may  already  be  a  certain  mQ£m  such  that  rCLAss ^ra0 ' t0 ^ ~ 
rf,  and  ^(.,,,1 0)»tf.  o 

Example  12.  Let  CLASS={ EXAMP1)  and  consider  the  computa¬ 
tion  of  the  parallel  Roberts  gradient  as  described  in  Example 
1.  In  this  case  we  get  the  trivial  lower  time  bound  1  only; 
an  upper  bound  was  29.  Now,  let  CLASS={ EXAMP3}  and  consider  the 
computation  of  the  arithmetical  averages  of  M  consecutive  waves 
of  information  of  length  N=2n”  as  described  in  Example  3.  Here 


by  Corollary  1  we  obtain  the  lower  time  bound  n+2M-2=max{n-l , 
n+2M-2,  n+M-1},  cp.  equation  (6.1),  (6.2),  (6.3),  for  values 

^f=N,Yf=N*M  and  An  upper  bound  was  6M+n. 

Using  common  asymptotic  notations,  for  both  examples  the 
optimal  times  0(1)  and  0(M+n)  are  known  as  a  result. 

Theorem  3 .  For  any  system  SYS€OFF-NETp ,  p?2,  the  computa¬ 
tion  of  a  function  f  which  is  into  the  set  of  m-tuples  of  real 
numbers  requires  at  least  tp  steps  of  computation  in  the  worst 
case ,  where 

tQ5max{  (d-^-D/2,  (d2-m)/2m,  (d3-m)/2m) 
for  p=2,  and  for  p>3 

t(j-max{logp_1  (d.^  (p-2)  +2)  -1. 586, 

logp_1 (d2 (p-2) +2) -logp_1m-l . 586 , 
logp^  (d3  (p-2)  +2)  -logp_1m-l .  586  ) , 
if  f  is  locally  d^dependent,  globally  d2~dependent ,  and  totally 
d3~dependent. 

Proof .  Immediately  by  Theorem  1,  Definition  7  and  Corollary 
1  where  the  relation  logp_1p>l . 586 ,  pi3,  was  used.  □ 

In  Table  7  are  collected,  for  the  classes  of  off-line  systems 
defined  in  Section  1,  the  lower  time  bounds  that  may  be  obtained 
by  using  Corollary  1.  Because  the  classes  OFF-LINEAR,  OFF-PS, 
OFF-BINTREE  and  OFF-QUADTREE  represent  examples  for  the  maximal 
transfer  situation  as  characterized  by  Theorem  1,  for  these 
classes  the  lower  time  bounds  are  as  given  by  Theorem  3.  If  a 
function  f  into  the  set  of  m-tuples  is  globally  or  totally  d'- 
dependent,  then  the  value  d  has  to  be  replaced  by  d'/m  in  the 


lower  time  bounds  given  in  Table  7,  to  obtain  the  corresponding 
values  for  the  global  or  total  situation. 

Theorem  4.  For  any  system  SYS€ON-NET_  25p<°°,  lsq<p,  the 

-  p,q 

computation  of  a  function  f  which  is  into  the  set  of  m-tuples  of 
real  numbers  requires  at  least  tg  steps  of  computation  in  the  worst 
case,  where 

tg^max{ (d^+l)/2,  (d2+m)/2m,  (d3+m)/2m} 
for  q=l ,  and  for  q>2 

tgSmaxUogg^  (q-l)+l)  ,  logg  (d2  (q-1) /m  +  1), 
logq(d3  (q-l)/4n  +  1}, 

if  f  is  locally  d^-dependent ,  globally  d2-dependent ,  and  totally 
d3 -dependent. 

Proof.  Immediately  by  Theorem  2,  Definition  7  and  Corollary 
1 .  □ 

In  Table  8  are  collected,  for  the  classes  of  on-line  systems 
defined  in  Section  1,  the  lower  time  bounds  that  may  be  obtained 
by  using  Corollary  1.  Because  the  classes  ON-LINEAR ^  g  ^ ,  0N- 
B INTREE . .  0l,  and  ON-QUADTREE  .  ,  ,  .  represent  examples  for 
maximal  transfer  situations  as  characterized  by  Theorem  2,  for 
these  classes  the  lower  time  bounds  are  as  stated  by  Theorem  4. 

As  in  the  case  of  Table  7,  if  a  function  f  into  the  set  of  m-tuples 
is  globally  or  totally  d' -dependent ,  then  the  value  d  has  to  be  re¬ 
placed  by  d'/m  in  the  lower  time  bounds  given  in  Table  8,  for  ob¬ 
taining  the  corresponding  values  for  the  global  or  total  situation. 
Note  that  value  m  may  be  replaced  by  a  value  itigSm  for  special  ON- 
NET  systems. 


5 .  Conclusions 

In  this  paper  we  have  given  a  general  framework  for  the 
description  of  parallel  processing  systems,  and  explained 
how  data  flow  may  be  used  for  analyzing  lower  time  bounds  in 
general.  Note  that  this  approach  may  be  applied  to  supercompu¬ 
ters  as  well  as  to  on-chip  realizations.  Problems  connected 
with  the  technical  features  of  architecture  elements  were  by¬ 
passed  by  the  selected  level  of  abstract  system  description. 
Thus,  in  the  discussion  of  parallel  algorithms  for  a  given 
model  SYS €SIMD  we  may  have  in  mind  quite  different  technical 
implementations,  but  we  may  discuss  parallel  algorithms  for 
all  of  them  at  once  using  the  abstract  model  SYS6SIMD.  For 
example,  an  important  problem  is  given  by  the  necessary  deci¬ 
sion  between  different  structures  of  parallel  processing  systems 
to  ensure  efficient  algorithmic  solutions  for  classes  of  com¬ 
putational  problems  such  as  mentioned  in  Example  8  (matrix-type 
computations) ,  9  (two-dimensional  transforms) ,  10  (geometric 
problems) ,  or  11  (combinatorial  problems) .  According  to  our 
considerations  in  [4]  the  selection  of  parallel  algorithms  cru¬ 
cially  depends  on  the  given  parallel  processing  system  and 
comparisons  between  different  SIMD  systems  on  the  basis  of  know¬ 
ledge  about  optimal  algorithms  represents  quite  a  hard  task. 
Also,  there  are  nearly  as  many  different  models  for  parallel 
processing  as  papers  on  this  topic,  making  comparative  studies 
of  different  parallel  structures  nearly  impossible.  In  the 
present  paper  an  attempt  was  made  to  propose  a  classification 


of  special  parallel  processing  systems  which  have  been  of  wide¬ 
spread  interest  in  the  past.  The  proof  of  the  practicability 
of  the  proposed  exact  definition  of  SIMD  systems  will  be  the 
subject  of  forthcoming  papers;  the  first  programs  of  the  PARSIS 
project  fit  well  into  this  framework. 

By  using  Tables  6,7,  and  8  the  interested  reader  may  ob¬ 
tain  lower  time  bounds  for  different  combinations  of  SIMD 

systems  and  computational  problems,  e.g.,  the  lower  time  bound 
2 

log2  (N  +1)  for  the  two-dimensional  Walsh  transform  on  ON- 
TRIANGLE  systems.  The  characterization  of  data  dependencies 
for  computational  problems  as  given  by  Definition  7  may  be 
refined,  e.g.,  by  consideration  of  changes  of  function  values 
not  only  by  changing  arguments  in  one  position  but  in  several 
positions. 
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Uniform  networks 


Instructions 

Changes  of  receptive  fields 

(mask) 

0P^  m 

rec t ( j ,0) ,t+l)  »  rec ( ( j ,m) , t) 

(mask) 

0PX  *m 

rec(  (j,0)  ,t+l)  -  recttj.m)  ,t)  U 

rec((j,c(j,m))  ,t) 

[mask] 

0PX  :i 

rec  ( ( j ,  0)  ,  t+1)  -  rec(  (f^tj)  #0)  ,t) 

[mask] 

0P2  m 

rec  ( (j ,0) ,t+l)  ■  rec((j,0) ,t)U 
rec(  (j,m)  ,t) 

[mask] 

OP  2  *m 

rec ( ( j ,0) ,t+l)  -  rec ( ( j ,0) ,t)U 
rec  ( ( j  ,m)  ,t)Urec  ( <  j  ,c(  j  ,m) )  ,t) 

[mask] 

0Pi+l:il'i2" 

rec  ( ( j  ,0)  ,t+l)  -  rec((j,0)  ,t)Urec(  (f 
0)  ,  t)  Urec ( (f .  (j)  ,0)  ,t)U...U 

recttf.  (j),0),t) 
xt 

[mask] 

STORE  m 

rec(  (j  ,m)  ,t+l)  -  rec((j,0),t) 

[mask] 

STORE  *m 

rec  ( t j ,c( j  ,m)  ,t+l)  -  recttj.0)  ,t)U 
rec(  (j  ,m)  ,t) 

[mask] 

STORE  :  il#i2 . i^ 

rec ( (f .  ( j) ,0) ,t+l)  -  rec ( ( j ,0) , t) , 
A1 

rec ( (f ,  (j) ,0) ,t+l)  - 
12 

rec  ( { j  ,0)  ,t) , . . .  ,rec( (f ,  (j)  ,0)  ,t+l) 

l 

rec(  (j  ,0)  ,t) 

[mask] 

READ  ffl 

rec  ( j  ,m)  ,t+l)  -  t(j,m)(WN>  } 

[mask] 

READ  *m 

rec(  (j  ,c(j,m)  ,t+l)  -  rec(  { j  ,m)  ,t)  U 
[(j,c(j,m))  (WN)  > 

X 

• 

r-« 

0* 

o 

'  rec (0, t+1)  -  t0<WM)} 

0PX  m 

rec(0,t+l)  »  rec(m,t) 

0PX  *m 

recto, t+1)  »  rec  (ra,t)  Urec  (c(m)  ,t) 

OPj^  (j) 

rec(0,t+l)  ■  rec((j,0),t) 

0P2  -  X 

recto, t+1)  -  recto, t)U[0(WN)  > 

0P2  m 

recto, t+1)  »  rec(0, t)Urec(m,t) 

0P2  »m 

recto, t+1)  -  recto, t)Urectm, t)  U 

rectctm)  ,t) 

op2  (j) 

recto, t+1)  -  recto, t)Urec((j, 0)  ,t) 

STORE  m 

rec(m,t+l)  »  recto, t) 

STORE  *m 

rec (c(m) , t+1)  ■  rec (0,t) Urec (m,t) 

STORE  (j) 

rec ( ( j  ,0)  ,t+l)  -  recto, t) 

HEAD  m 

rec  (m,  t+1)  «  lm*WN^} 

READ  *m 

rectctm) , t+1)  *  rec (m,t)U[c (m)  ) 

Table  3.  Changes  of  receptive  fields  in  step  t-1 


LINEAR 

2 

2t+l 

9 

17 

HEXAGONAL 

3 

ft2+  |t+l 

31 

109 

SQUARE  or 
ILLIAC 

4 

2t2+3t+l 

41 

145 

TRIAGONAL 

6 

3t2+3t+l 

61 

215 

DIAGONAL 

8 

4t2+4t+l 

81 

289 

PS 

3 

3  *2^-2 

46 

766 

BINTREE 

3 

3 • 2t-2 

46 

766 

top  node 

• 

• 

2t+1-l 

31 

511 

TRIANGLE 

5 

3-2t+1+t2-2t-5 

99 

1,579 

top  node 

• 

• 

2t+1-l 

31 

511 

QUADTREE 

5 

(5*4t-2)/3 

426 

109,226 

top  node 

• 

• 

(4t+1-l)/3 

341 

87,381 

CLASS 

P 

^1 '  ^2  '  '  *  *  '  ^ 

aon-class (t) 

ft 

II 

4* 

rt 

II 

00 

LINEAR 

2 

{0} 

2t-l 

7 

15 

HEXAGONAL 

3 

{0,1> 

t(t+l)/2 

10 

36 

{0} 

2t-l 

7 

15 

SQUARE  or 
ILLIAC 

4 

{0,1,2} 

t2 

16 

64 

{0,2} 

t (t+1) / 2 

10 

36 

{0,1}, {0} 

2t-l 

7 

15 

TRIAGONAL 

6 

{0,1, 2, 3, 4} 

31 

121 

{0,2, 3, 4} 

h2-  it 

2  2 

22 

92 

{0,2,4} 

t2 

16 

64 

DIAGONAL 

8 

{0,1, 2, 3, 4, 6, 7} 

\t2-  It+1 

43 

197 

BINTREE 

3 

{1,2} 

2t-l 

15 

255 

{0,1} 

t(t+l)/2 

10 

36 

TRIANGLE 

5 

{1,2, 3, 4} 

2fc-l 

15 

255 

QUADTREE 

5 

{1,2, 3, 4} 

(4t-l)/3 

85 

21,845 

PS 

3 

{0,1} 

( [ (l+v3) t+3- 
V5-2t+3)-2 

(1-V5)t+3]/ 

11 

87 

Table  5.  General  local  data  transfer  functions  for  on-line  systems 


Computational  problem  f 


a 


m 


Computational  problem  f 

n 

m 

Xf 

Yf 

Tf 

MATRIX  MULTIPLICATION 

2 

2N 

N2 

2N 

2N2 

2N3 

MATRIX  INVERSION  IP 

N2 

N2 

N2 

N2 

N4 

DETERMINANT 

N2 

1 

N2 

LINEAR  EQUATIONS 

n2+n 

N 

n2+n 

n2+n 

n3+n2 

TRANSPOSITION  IP 

N2 

N2 

1 

N2 

N2 

MATRIX  rr  IP 

N2 

N2 

2 

for  rr^id 

N2 

2N2-#{ (i,j)  : 
rr  (i,  j)  =  (i,  j) 

2D-DFT 

2N2 

2N2 

>2N2-4 

52N2-1 

2N2 

2  2N4 

4  2 

24N  -2N 

2D-WT 

N2 

N2 

N2 

N2 

N4 

ROBERTS  GRADIENT 

MN 

NM 

4 

MN 

4MN-2M-2N-2 

CH  SI POL 

2N 

2N 

2N 

2N 

>2N2-8N+12 

2 

S4N 

VORONOI  DIAGRAM 

2N 

18N-33 

2N 

2N 

>12N-30 

S36N2-66N 

PATTERN  MATCHING 

N+M 

N-M+l 

2N 

M+N 

2M (N-M+l) 

PATTERN  SIGNALIZATION 

N+M 

1 

>  max { 2M , M+  LN/M  J  }  ,  2M+N 

SORTING 

N 

N 

N 

N 

N2 

p 


lower  time  bound 


d=128 


d=128 


LINEAR  2  (d-l)/2 

HEXAGONAL  3  (|d- |)  1/2-l ) /2 


SQUARE  or 
ILLIAC 


((2d-l)1/2-l)/2 


TRIANGONAL  6 


((§d-  |)  1/2-l)/2 


DIAGONAL 


(d1/2-l)/2 


PS 


log2(d+2) -1.586 


BINTREE  3 

top  node 


log2(d+2) -1.586 
log2 (d+1) -1 


TRIANGLE  5 

top  node 


QUADTREE  5 

top  node 


t0~log2 (d“to+2to+5) -2.586 
log 2 (d+1) -1 


l°g4 ( 3d+2) -1.161 
log4(3d+l)-l 


64  8,192 


9  105 


8  91 


7  74 


6 

6 

6 

7 

5 

7 

4 

5 


64 

13 

13 

14 

12 

14 


7 


2 

CLASS  p  {i^,...,i  }  Lower  time  bound  d=128  d=128 


LINEAR 

2 

{0} 

(d+l)/2 

65 

8,193 

HEXAGONAL 

3 

{0,1} 

( (8d+l)1/2-l)/2 

16 

181 

SQUARE  or  ILLIAC 

4 

{0,1,2} 

a1'2 

12 

128 

TRIAGONAL 

6 

{0,1, 2, 3, 4} 

(  (|d-  |)1/2-l)/2 

7 

81 

DIAGONAL 

8 

{0,1,2, 3, 4, 6, 7 

}((|d-  |)  1/2-l)/2 

6 

64 

BINTREE 

3 

{1,2} 

log2(d+l) 

8 

15 

TRIANGLE 

5 

{1,2, 3,4} 

log2(d+l) 

8 

15 

QUADTREE 

5 

{1,2, 3,4} 

log4 (3d+l) 

5 

8 

PS 

3 

{0,1} 

f .  . o^d+2  for  the 

to+2 

Fibonacci  numbers 

f0'fl'f2'  “  * 

11 

21 

Table  8 .  Lower  time  bounds  for  on-line  systems  in 

ON-CLASS  for  computing  a  local  d-dependent 
function. 
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